This document follows the rep_py.Rmd which prepared the .csv which is used as a base for the research. The document shows the analysis of the Water Quality Index (WQI) in King County, WA, for the period 1970-2023.
The analysis answers two questions raised in rep_py.Rmd and provides additional insights and tools to perform exploratory data analysis.
Q1: How has WQI changed over the years? Was WQI better in the past?
Q2: How does WQI change with locators' geography? Is there any pattern such as more densely populated areas have WQI worse than sparsely populated areas? Does any of the areas such as north/south/east/west have cleaner water over the others?
The answers to those questions can be found in the Summary section.
# for exporting externally and hide the code-cells in the export file do:
# jupyter nbconvert --to html --no-input file.ipynb
import plotly.io as pio
pio.renderers.default='notebook'
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# for maps
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Instead of setting the cell to Markdown, create Markdown from withnin a code cell!
# We can just use python variable replacement syntax to make the text dynamic
from IPython.display import Markdown as md
import myutils as ut
# set rows and columns to show all of it
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# setup for seaborn plots
sns.set_style('darkgrid')
sns.set_palette('Set2')
# the style of the map
MAP_STYLE = "open-street-map"
#MAP_STYLE = "stamen-terrain"
# center of the map
MAP_CENTER_LAT = 47.5
MAP_CENTER_LNG = -122.249
# the effort to change the font size in ipython.display.Markdown so
# the fonts match in md() match the cell fonts.
# so far it does not work
from IPython.core.display import HTML
HTML("""
<style>
div.cell { /* Tunes the space between cells */
margin-top:1em;
margin-bottom:1em;
}
div.text_cell_render h1 { /* Main titles bigger, centered */
font-size: 2.2em;
line-height:1.4em;
text-align:center;
}
div.text_cell_render h2 { /* Parts names nearer from text */
margin-bottom: -0.4em;
}
div.text_cell_render { /* Customize text cells */
font-family: 'Times New Roman';
font-size:1.5em;
line-height:1.4em;
padding-left:3em;
padding-right:3em;
}
</style>
""")
#%%html
#<style type='text/css'>
#.CodeMirror{
#font-size: 20px;
#}
#</style>
This is the dataframe after processing that is our base data for analysis.
# there was a warning about mixed types in column=3 so I set the low_memory to False
# my processed data
df = pd.read_csv(ut.PATH_DATA_PROCESSED, low_memory = False)
# for keeping our max and min year of the dataframe
YEAR_MIN = min(df["WaterYear"])
YEAR_MAX = max(df["WaterYear"])
# summary of data frame
df.describe(include = "all")
| Locator | WaterYear | WQI | Month | ParameterGroup | lng | lat | MostRecentSample | SiteName | StreamName | WQI_binned | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 319326 | 319326.000000 | 319326.000000 | 319326.000000 | 319326 | 319326.000000 | 319326.000000 | 319326 | 319326 | 319326 | 319326.000000 |
| unique | 78 | NaN | NaN | NaN | 13 | NaN | NaN | 2 | 78 | 60 | NaN |
| top | 3106 | NaN | NaN | NaN | Temperature | NaN | NaN | False | Green River at Starfire Way | Green | NaN |
| freq | 6981 | NaN | NaN | NaN | 28819 | NaN | NaN | 318426 | 6981 | 25473 | NaN |
| mean | NaN | 2003.907562 | 80.837086 | 6.983127 | NaN | -122.153147 | 47.566153 | NaN | NaN | NaN | 2.602231 |
| std | NaN | 13.302435 | 25.510223 | 3.751370 | NaN | 0.143708 | 0.159583 | NaN | NaN | NaN | 0.655062 |
| min | NaN | 1970.000000 | 0.000000 | 1.000000 | NaN | -122.508600 | 47.176100 | NaN | NaN | NaN | 1.000000 |
| 25% | NaN | 1994.000000 | 75.370000 | 4.000000 | NaN | -122.233900 | 47.465500 | NaN | NaN | NaN | 2.000000 |
| 50% | NaN | 2005.000000 | 90.820000 | 7.000000 | NaN | -122.166400 | 47.601100 | NaN | NaN | NaN | 3.000000 |
| 75% | NaN | 2016.000000 | 98.310000 | 10.000000 | NaN | -122.069600 | 47.705400 | NaN | NaN | NaN | 3.000000 |
| max | NaN | 2023.000000 | 100.000000 | 13.000000 | NaN | -121.372200 | 47.811400 | NaN | NaN | NaN | 3.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 319326 entries, 0 to 319325 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Locator 319326 non-null object 1 WaterYear 319326 non-null int64 2 WQI 319326 non-null float64 3 Month 319326 non-null int64 4 ParameterGroup 319326 non-null object 5 lng 319326 non-null float64 6 lat 319326 non-null float64 7 MostRecentSample 319326 non-null bool 8 SiteName 319326 non-null object 9 StreamName 319326 non-null object 10 WQI_binned 319326 non-null int64 dtypes: bool(1), float64(3), int64(3), object(4) memory usage: 24.7+ MB
The dataframe looks like it:
df.head()
| Locator | WaterYear | WQI | Month | ParameterGroup | lng | lat | MostRecentSample | SiteName | StreamName | WQI_binned | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 311 | 1970 | 70.93 | 13 | AnnualScore | -122.2479 | 47.4655 | False | Green River at Interurban | Green | 2 |
| 1 | 311 | 1971 | 61.14 | 13 | AnnualScore | -122.2479 | 47.4655 | False | Green River at Interurban | Green | 2 |
| 2 | 311 | 1972 | 74.90 | 13 | AnnualScore | -122.2479 | 47.4655 | False | Green River at Interurban | Green | 2 |
| 3 | 311 | 1973 | 75.38 | 13 | AnnualScore | -122.2479 | 47.4655 | False | Green River at Interurban | Green | 2 |
| 4 | 311 | 1974 | 83.90 | 13 | AnnualScore | -122.2479 | 47.4655 | False | Green River at Interurban | Green | 3 |
# create a station location df
loc_df = df[['Locator', 'SiteName', 'lng', 'lat']]
loc_df = loc_df.drop_duplicates()
loc_df.head()
| Locator | SiteName | lng | lat | |
|---|---|---|---|---|
| 0 | 311 | Green River at Interurban | -122.2479 | 47.4655 |
| 49 | 317 | Springbrook Creek mouth at SW 16th St | -122.2326 | 47.4659 |
| 97 | 321 | Crisp Creek mouth at SE Green Valley Rd | -122.0671 | 47.2884 |
| 133 | 322 | Newaukum Creek mouth at 212th Way SE | -122.0562 | 47.2741 |
| 182 | 430 | Lyon Creek mouth at Bothell Way NE | -122.2778 | 47.7529 |
loc_count = len(loc_df.index)
loc_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 78 entries, 0 to 2468 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Locator 78 non-null object 1 SiteName 78 non-null object 2 lng 78 non-null float64 3 lat 78 non-null float64 dtypes: float64(2), object(2) memory usage: 3.0+ KB
md(f"Records show that there have been {loc_count} stations reporting WQI during the period {YEAR_MIN}-{YEAR_MAX}.")
Records show that there have been 78 stations reporting WQI during the period 1970-2023.
Stations locations are presented on the map below.
fig = px.density_mapbox(loc_df, lat = "lat", lon = "lng", radius = 10,
center = dict (lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG), zoom = 8, hover_name = "SiteName",
hover_data = 'Locator',
mapbox_style = MAP_STYLE,
range_color = [0, 1],
title = f"Locations of stations in King County, WA, {YEAR_MIN}-{YEAR_MAX}; {loc_count} stations.",
width = 800,
height = 900)
fig.show()
# create a station location df wrt year
loc_year_df = df[['Locator', 'SiteName', 'lng', 'lat', 'WaterYear']]
loc_year_df = loc_year_df.drop_duplicates()
loc_year_df.head()
| Locator | SiteName | lng | lat | WaterYear | |
|---|---|---|---|---|---|
| 0 | 311 | Green River at Interurban | -122.2479 | 47.4655 | 1970 |
| 1 | 311 | Green River at Interurban | -122.2479 | 47.4655 | 1971 |
| 2 | 311 | Green River at Interurban | -122.2479 | 47.4655 | 1972 |
| 3 | 311 | Green River at Interurban | -122.2479 | 47.4655 | 1973 |
| 4 | 311 | Green River at Interurban | -122.2479 | 47.4655 | 1974 |
# loc_year_df.loc[loc_year_df["WaterYear"] == 2023]
# if you do not sort it Dash might present the 'animation_frame' out of order
loc_year_df = loc_year_df.sort_values('WaterYear', ascending=True)
#loc_year_df.head()
from dash import Dash, html, dash_table, dcc, callback, Input, Output
app = Dash(__name__)
def display_animated_graph(my_df):
fig = px.density_mapbox(my_df,
lat = "lat", lon = "lng", radius = 10,
center = dict (lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG), zoom = 8, hover_name = "SiteName",
hover_data = "Locator",
mapbox_style = MAP_STYLE,
title = f"Locations of stations for a selected year in King County, WA, over the years {YEAR_MIN}-{YEAR_MAX}.",
animation_frame = "WaterYear")
return fig
app.layout = html.Div([
html.H4(children = ["Available stations in King County, WA, over years ", YEAR_MIN, "-", YEAR_MAX]),
html.Div(children=["WQI in King County, WA, ", YEAR_MIN, "-", YEAR_MAX]),
#dash_table.DataTable(data=loc_df.to_dict('records'), page_size=10),
html.Hr(),
# 'style' will change the graph to be x% of the viewport height of the browser.
# more can be found: https://css-tricks.com/fun-viewport-units/
dcc.Graph(figure = display_animated_graph(loc_year_df),
id = "anim_graph",
# style={'width': '90vw', 'height': '95vh'}
style={'width': '900px', 'height': '800px'}
)
#dcc.Loading(dcc.Graph(id = "graph"), figure = display_animated_graph(loc_year_df), type = "cube"),
# dcc.RadioItems(options=["2023", "2022"], value="2023", id="year_id"),
# dcc.Graph(figure={}, id="controls_and_graph")
])
if __name__ == '__main__':
app.run(debug=True)
# get only AnnualScore and reshape the dataframe, this df will be used
# and showed later in this document.
df_an = df[df["ParameterGroup"] == "AnnualScore"]
#loc = df.Locator.unique().tolist()
# reshape the dataframe
#pd.melt(df_annual, id_vars="Locator", value_vars=loc, var_name = "Location", value_name='WQI')
df_an = df_an.pivot_table(index="WaterYear", columns=['Locator'], values=['WQI'])
# change the names of the columns because they are in the form of (WQI, '311')
locators = [x[1] for x in df_an.columns]
df_an.columns = locators
# count the non-null values in the rows
no_stations_yearly = pd.DataFrame(df_an.count(axis=1), columns = ["StationsCount"])
#len(no_stations_yearly.columns)
# some basic descriptive stats
min_no_stations = no_stations_yearly["StationsCount"].min()
max_no_stations = no_stations_yearly["StationsCount"].max()
def get_no_stations(datafr, cond):
return datafr[ (datafr == cond)["StationsCount"]]
years_for_min = get_no_stations(no_stations_yearly, min_no_stations)
years_for_max = get_no_stations(no_stations_yearly, max_no_stations)
""" Plots the years_for_min and *_max
@in dat_fr : frame to be shown
"""
def plot_years(dat_fr):
print(dat_fr)
#fig = go.Figure(data=[go.Table(
# header=dict(values=['Year', 'No. of Stations']),
# cells=dict(values=[dat_fr.index,
# dat_fr["StationsCount"]]))
#])
#fig.update_layout(width=400, height = 300)
#fig.show()
md(f"<h5>First stations that started reporting WQI for the King County were:</h5>")
oldest_stations = loc_year_df.loc[loc_year_df["WaterYear"] == 1970]
print(oldest_stations[['WaterYear', 'Locator', 'SiteName']].to_string(index = False))
WaterYear Locator SiteName
1970 311 Green River at Interurban
1970 KTHA01 Piper's Creek
curr_count = no_stations_yearly.loc[2023, "StationsCount"]
md(f"<h5>Currently we have {curr_count} stations reporting. "
f"The service started in 1970 with two stations and it was the least number"
f" of stations so far:"
f"</h5>")
plot_years(years_for_min)
StationsCount WaterYear 1970 2
The maximum number of stations for the given period was:
plot_years(years_for_max)
StationsCount WaterYear 2021 76
Let's see how the number of stations measuring WQI was changing over the years?
def plot_no_stations_yearly(df_):
f = px.line(df_, x=df_.index, y="StationsCount", markers=True,
title=f"The number of stations measuring WQI in King County, WA, over years {YEAR_MIN}-{YEAR_MAX}.",
labels = {"WaterYear" : "Year", "StationsCount" : "No. of Stations"})
#f.update_traces(mode="markers+lines", hovertemplate=None)
#f.update_layout(hovermode="x")
f.update_xaxes(dtick=2)
f.update_yaxes(dtick=10)
return f
plot_no_stations_yearly(no_stations_yearly).show()
In general, the number of stations has been growing, although it is clear that certain years observed stations closures and/or resumptions of the service. For instance, the drops in the total number of stations servicing the region appeared in the following years:
There were also sharp increases in the number of operating stations. The significant increases in the number of stations serving the region appeared in the following years:
If we study Figure Stations Location, we can see that, in general, the number of stations has been growing, although there are interruptions and resumptions in stations' service. The figure shows that in 1972, the stations were added to the core area. In 1973, stations from the middle of the area were closed, and in 1978, there was a closure of the north stations. However, in 1979, they restored stations that were closed in 1978. In 2007, six reporting entities were added to Vashon Island. Next, there were significant closures in the eastern frontier in 2010. However, in 2011, there was a substantial expansion including deep east, North Bend, Enumclaw, and far east at South Fork Skykomish at Hgwy 2.
Finally, in 2015, more stations were added to the eastern and northern parts of the area.
This short analysis leads to a conclusion that some stations have been serving for a long period of time, while some have a shorter tenure.
In order to analyze WQI over the years, I have prepared a more appropriate dataframe that focuses on the annual WQI. The following section provides some info about this slice.
Below, the basic info about the dataframe is presented.
df_an.info()
<class 'pandas.core.frame.DataFrame'> Index: 54 entries, 1970 to 2023 Data columns (total 78 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0450CC 16 non-null float64 1 0484A 3 non-null float64 2 3106 53 non-null float64 3 311 49 non-null float64 4 317 48 non-null float64 5 321 36 non-null float64 6 322 49 non-null float64 7 430 51 non-null float64 8 434 50 non-null float64 9 438 30 non-null float64 10 440 48 non-null float64 11 442 45 non-null float64 12 444 44 non-null float64 13 446 48 non-null float64 14 470 51 non-null float64 15 474 51 non-null float64 16 478 50 non-null float64 17 484 47 non-null float64 18 486 48 non-null float64 19 631 48 non-null float64 20 632 46 non-null float64 21 A315 47 non-null float64 22 A319 45 non-null float64 23 A320 46 non-null float64 24 A432 44 non-null float64 25 A438 44 non-null float64 26 A456 42 non-null float64 27 A499 42 non-null float64 28 A617 26 non-null float64 29 A620 29 non-null float64 30 A631 45 non-null float64 31 A670 11 non-null float64 32 A680 31 non-null float64 33 A685 25 non-null float64 34 A687 5 non-null float64 35 A690 26 non-null float64 36 AMES_1 15 non-null float64 37 B319 50 non-null float64 38 B484 46 non-null float64 39 B499 7 non-null float64 40 BB470 20 non-null float64 41 BSE_1MUDMTNRD 13 non-null float64 42 C320 46 non-null float64 43 C370 36 non-null float64 44 C446 33 non-null float64 45 C484 45 non-null float64 46 CHERRY_1 16 non-null float64 47 D320 48 non-null float64 48 D474 34 non-null float64 49 F321 29 non-null float64 50 G320 46 non-null float64 51 GRIFFIN 13 non-null float64 52 HARRIS_1 15 non-null float64 53 J484 36 non-null float64 54 KSHZ06 37 non-null float64 55 KTHA01 54 non-null float64 56 KTHA02 32 non-null float64 57 KTHA03 34 non-null float64 58 LSIN1 23 non-null float64 59 LSIN9 22 non-null float64 60 MFK_SNQ 13 non-null float64 61 N484 45 non-null float64 62 NFK_SNQ 13 non-null float64 63 PATTER_3 15 non-null float64 64 RAGING_MTH 13 non-null float64 65 S478 18 non-null float64 66 S484 38 non-null float64 67 SFK_SNQ 13 non-null float64 68 SKYKOMISH 13 non-null float64 69 SNQDUVALL 13 non-null float64 70 TOLT_MTH 13 non-null float64 71 VA12A 16 non-null float64 72 VA37A 12 non-null float64 73 VA41A 16 non-null float64 74 VA42A 17 non-null float64 75 VA45A 16 non-null float64 76 VA65A 15 non-null float64 77 X630 24 non-null float64 dtypes: float64(78) memory usage: 35.4 KB
This is how a few rows of the sliced dataframe look like:
df_an.head()
| 0450CC | 0484A | 3106 | 311 | 317 | 321 | 322 | 430 | 434 | 438 | 440 | 442 | 444 | 446 | 470 | 474 | 478 | 484 | 486 | 631 | 632 | A315 | A319 | A320 | A432 | A438 | A456 | A499 | A617 | A620 | A631 | A670 | A680 | A685 | A687 | A690 | AMES_1 | B319 | B484 | B499 | BB470 | BSE_1MUDMTNRD | C320 | C370 | C446 | C484 | CHERRY_1 | D320 | D474 | F321 | G320 | GRIFFIN | HARRIS_1 | J484 | KSHZ06 | KTHA01 | KTHA02 | KTHA03 | LSIN1 | LSIN9 | MFK_SNQ | N484 | NFK_SNQ | PATTER_3 | RAGING_MTH | S478 | S484 | SFK_SNQ | SKYKOMISH | SNQDUVALL | TOLT_MTH | VA12A | VA37A | VA41A | VA42A | VA45A | VA65A | X630 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WaterYear | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1970 | NaN | NaN | NaN | 70.93 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 37.89 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1971 | NaN | NaN | 48.31 | 61.14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 23.36 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1972 | NaN | NaN | 67.97 | 74.90 | NaN | 83.43 | 77.38 | 80.32 | 84.46 | 88.33 | 89.09 | 95.55 | NaN | NaN | 86.95 | 87.80 | 91.22 | 90.80 | NaN | 89.5 | 88.82 | NaN | 79.75 | 89.12 | NaN | 93.45 | NaN | NaN | NaN | NaN | 83.5 | NaN | NaN | NaN | NaN | NaN | NaN | 71.03 | 87.38 | NaN | NaN | NaN | 92.2 | NaN | NaN | NaN | NaN | 91.42 | NaN | NaN | 84.40 | NaN | NaN | NaN | NaN | 47.24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1973 | NaN | NaN | 67.19 | 75.38 | 92.03 | 84.14 | 60.90 | 61.22 | NaN | NaN | NaN | NaN | NaN | NaN | 66.29 | 55.99 | 62.88 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 85.98 | 54.10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 49.08 | NaN | 67.46 | NaN | NaN | NaN | NaN | 51.35 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1974 | NaN | NaN | 87.43 | 83.90 | NaN | NaN | NaN | 37.59 | 41.80 | NaN | NaN | NaN | NaN | NaN | 39.42 | 30.67 | 38.24 | 36.28 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 51.15 | NaN | NaN | 32.02 | NaN | NaN | NaN | NaN | 63.2 | NaN | 63.21 | NaN | NaN | NaN | NaN | NaN | 64.63 | NaN | NaN | NaN | 54.61 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
The descriptive stats are presented as follows:
#%%capture
df_an.describe()
| 0450CC | 0484A | 3106 | 311 | 317 | 321 | 322 | 430 | 434 | 438 | 440 | 442 | 444 | 446 | 470 | 474 | 478 | 484 | 486 | 631 | 632 | A315 | A319 | A320 | A432 | A438 | A456 | A499 | A617 | A620 | A631 | A670 | A680 | A685 | A687 | A690 | AMES_1 | B319 | B484 | B499 | BB470 | BSE_1MUDMTNRD | C320 | C370 | C446 | C484 | CHERRY_1 | D320 | D474 | F321 | G320 | GRIFFIN | HARRIS_1 | J484 | KSHZ06 | KTHA01 | KTHA02 | KTHA03 | LSIN1 | LSIN9 | MFK_SNQ | N484 | NFK_SNQ | PATTER_3 | RAGING_MTH | S478 | S484 | SFK_SNQ | SKYKOMISH | SNQDUVALL | TOLT_MTH | VA12A | VA37A | VA41A | VA42A | VA45A | VA65A | X630 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 16.000000 | 3.000000 | 53.000000 | 49.000000 | 48.000000 | 36.000000 | 49.000000 | 51.000000 | 50.000000 | 30.000000 | 48.000000 | 45.000000 | 44.000000 | 48.000000 | 51.000000 | 51.000000 | 50.000000 | 47.000000 | 48.000000 | 48.000000 | 46.000000 | 47.000000 | 45.000000 | 46.000000 | 44.000000 | 44.000000 | 42.000000 | 42.000000 | 26.000000 | 29.000000 | 45.000000 | 11.000000 | 31.000000 | 25.000000 | 5.000000 | 26.000000 | 15.000000 | 50.000000 | 46.000000 | 7.000000 | 20.000000 | 13.000000 | 46.000000 | 36.000000 | 33.000000 | 45.000000 | 16.000000 | 48.000000 | 34.000000 | 29.000000 | 46.000000 | 13.000000 | 15.000000 | 36.000000 | 37.000000 | 54.000000 | 32.000000 | 34.000000 | 23.000000 | 22.000000 | 13.00000 | 45.000000 | 13.000000 | 15.000000 | 13.000000 | 18.000000 | 38.000000 | 13.000000 | 13.000000 | 13.000000 | 13.000000 | 16.000000 | 12.000000 | 16.000000 | 17.000000 | 16.000000 | 15.000000 | 24.000000 |
| mean | 50.641250 | 76.490000 | 55.008679 | 59.684490 | 22.909583 | 69.857222 | 40.253265 | 49.911765 | 39.699600 | 81.270000 | 66.259792 | 60.903111 | 45.731818 | 47.083542 | 55.338235 | 44.491373 | 56.934600 | 50.723404 | 60.416667 | 63.294792 | 58.173261 | 37.267660 | 75.853778 | 75.323478 | 51.456591 | 83.093182 | 44.704048 | 62.548333 | 62.356154 | 62.341034 | 74.443556 | 72.128182 | 68.190000 | 79.306000 | 77.428000 | 78.972308 | 57.828667 | 82.786800 | 56.087826 | 68.325714 | 75.032000 | 68.303077 | 82.536739 | 50.669167 | 47.975758 | 63.139333 | 87.335000 | 85.342292 | 62.207941 | 87.160690 | 64.294130 | 89.433846 | 88.794000 | 70.484444 | 44.191892 | 54.216111 | 49.646875 | 63.874118 | 63.097826 | 94.545909 | 80.21000 | 68.792222 | 85.116923 | 68.924667 | 76.843077 | 75.176111 | 54.968684 | 73.287692 | 91.270769 | 78.295385 | 87.427692 | 78.340000 | 68.805833 | 58.378125 | 64.232353 | 49.475000 | 68.108000 | 45.736250 |
| std | 13.520557 | 9.666628 | 13.229444 | 11.770567 | 20.363709 | 12.253840 | 16.985961 | 17.469803 | 15.777013 | 9.045613 | 14.695290 | 19.475125 | 12.606108 | 15.870263 | 13.683539 | 13.752416 | 15.392678 | 17.705161 | 9.471905 | 13.445372 | 14.165452 | 24.838234 | 10.250498 | 13.623310 | 16.530379 | 9.051694 | 16.759309 | 14.498191 | 14.110325 | 10.932884 | 9.707563 | 10.976939 | 10.587843 | 10.677582 | 7.327446 | 12.278895 | 12.122151 | 9.150209 | 15.077185 | 12.241487 | 7.825058 | 11.498756 | 6.153182 | 19.603376 | 16.543382 | 11.398525 | 4.870128 | 4.846396 | 15.130157 | 12.287005 | 11.933127 | 3.674873 | 4.149175 | 10.840733 | 14.589828 | 15.451880 | 17.142440 | 14.007483 | 33.614609 | 5.820144 | 7.87998 | 9.137460 | 4.876263 | 9.090548 | 11.399926 | 9.759039 | 14.072949 | 11.829651 | 3.719611 | 10.628369 | 6.268454 | 11.019119 | 18.001418 | 17.678572 | 15.580920 | 20.147753 | 13.913445 | 14.073361 |
| min | 32.790000 | 69.780000 | 26.740000 | 33.840000 | 1.000000 | 46.850000 | 4.490000 | 13.000000 | 12.180000 | 57.280000 | 27.990000 | 19.290000 | 21.330000 | 8.150000 | 28.030000 | 13.420000 | 20.690000 | 7.820000 | 40.150000 | 33.770000 | 32.250000 | 1.000000 | 51.850000 | 27.190000 | 15.380000 | 57.690000 | 13.420000 | 38.070000 | 27.880000 | 36.750000 | 46.800000 | 47.630000 | 39.320000 | 43.190000 | 67.480000 | 40.590000 | 43.030000 | 58.920000 | 22.780000 | 49.720000 | 62.310000 | 51.060000 | 68.450000 | 11.070000 | 14.470000 | 41.330000 | 79.360000 | 73.350000 | 27.270000 | 57.220000 | 43.090000 | 84.170000 | 80.940000 | 33.930000 | 16.520000 | 23.360000 | 6.830000 | 18.870000 | 7.130000 | 83.470000 | 67.49000 | 49.330000 | 74.850000 | 53.480000 | 58.930000 | 54.610000 | 28.450000 | 57.460000 | 84.260000 | 54.180000 | 75.650000 | 50.540000 | 29.740000 | 29.380000 | 29.710000 | 19.780000 | 38.810000 | 19.470000 |
| 25% | 41.137500 | 70.950000 | 43.740000 | 50.600000 | 6.310000 | 60.095000 | 30.350000 | 36.700000 | 27.492500 | 76.672500 | 58.500000 | 46.010000 | 37.280000 | 34.177500 | 46.595000 | 35.745000 | 45.690000 | 36.820000 | 55.705000 | 53.997500 | 49.152500 | 14.525000 | 70.480000 | 68.132500 | 40.157500 | 77.087500 | 32.510000 | 49.950000 | 53.625000 | 57.910000 | 69.080000 | 66.225000 | 61.760000 | 74.390000 | 72.670000 | 76.347500 | 50.810000 | 78.992500 | 48.602500 | 64.445000 | 72.032500 | 59.330000 | 78.202500 | 38.867500 | 35.330000 | 55.700000 | 83.872500 | 81.990000 | 56.322500 | 76.170000 | 55.662500 | 86.000000 | 85.920000 | 66.787500 | 32.370000 | 42.482500 | 38.767500 | 58.557500 | 25.410000 | 91.042500 | 77.62000 | 64.060000 | 82.790000 | 63.720000 | 67.260000 | 67.172500 | 43.875000 | 62.200000 | 88.690000 | 75.770000 | 84.670000 | 73.042500 | 58.220000 | 41.060000 | 54.240000 | 34.512500 | 56.985000 | 38.172500 |
| 50% | 48.705000 | 72.120000 | 55.100000 | 61.140000 | 20.260000 | 72.010000 | 39.330000 | 47.200000 | 38.820000 | 83.415000 | 68.040000 | 61.210000 | 43.490000 | 47.735000 | 57.450000 | 45.720000 | 58.675000 | 53.660000 | 61.290000 | 64.695000 | 58.050000 | 37.420000 | 79.320000 | 78.475000 | 50.935000 | 84.990000 | 43.955000 | 62.845000 | 64.165000 | 63.570000 | 77.020000 | 75.000000 | 68.420000 | 77.440000 | 79.960000 | 81.490000 | 52.880000 | 85.250000 | 57.315000 | 67.060000 | 73.875000 | 70.070000 | 83.760000 | 48.840000 | 49.130000 | 63.840000 | 87.075000 | 84.845000 | 63.045000 | 93.280000 | 65.945000 | 89.930000 | 88.320000 | 71.970000 | 44.910000 | 57.835000 | 53.775000 | 63.745000 | 80.830000 | 97.840000 | 80.06000 | 70.240000 | 85.720000 | 66.570000 | 79.960000 | 77.810000 | 52.860000 | 73.480000 | 92.050000 | 78.810000 | 87.730000 | 81.165000 | 69.685000 | 62.675000 | 68.790000 | 48.225000 | 72.290000 | 44.610000 |
| 75% | 57.322500 | 79.845000 | 63.990000 | 65.760000 | 32.542500 | 79.810000 | 50.030000 | 63.740000 | 48.965000 | 87.557500 | 76.645000 | 75.580000 | 54.110000 | 59.687500 | 65.230000 | 53.180000 | 69.287500 | 61.020000 | 64.887500 | 72.535000 | 66.837500 | 51.930000 | 82.930000 | 85.367500 | 64.272500 | 90.550000 | 55.417500 | 74.222500 | 70.435000 | 69.290000 | 80.930000 | 79.550000 | 75.305000 | 87.090000 | 81.000000 | 85.945000 | 63.305000 | 89.480000 | 66.380000 | 71.305000 | 78.317500 | 74.760000 | 86.367500 | 62.347500 | 59.230000 | 68.780000 | 90.330000 | 88.922500 | 67.850000 | 95.440000 | 71.985000 | 92.460000 | 92.040000 | 76.367500 | 52.880000 | 65.787500 | 61.037500 | 72.295000 | 90.955000 | 99.692500 | 84.68000 | 74.370000 | 86.650000 | 75.875000 | 82.870000 | 83.325000 | 67.060000 | 83.360000 | 93.010000 | 87.250000 | 91.500000 | 86.235000 | 77.590000 | 72.522500 | 75.730000 | 66.945000 | 78.220000 | 53.650000 |
| max | 85.460000 | 87.570000 | 87.430000 | 86.580000 | 92.030000 | 94.050000 | 80.710000 | 88.720000 | 84.460000 | 96.160000 | 91.890000 | 95.550000 | 75.480000 | 81.370000 | 89.830000 | 87.800000 | 91.220000 | 90.800000 | 86.740000 | 89.500000 | 88.820000 | 96.140000 | 93.330000 | 92.960000 | 87.630000 | 97.480000 | 86.690000 | 88.500000 | 93.200000 | 84.490000 | 92.930000 | 86.260000 | 86.220000 | 94.240000 | 86.030000 | 94.160000 | 82.390000 | 94.740000 | 87.380000 | 90.000000 | 93.230000 | 94.120000 | 94.930000 | 90.540000 | 72.990000 | 86.160000 | 97.660000 | 95.110000 | 92.840000 | 98.690000 | 95.150000 | 96.150000 | 94.850000 | 89.570000 | 79.610000 | 82.780000 | 82.180000 | 85.210000 | 99.710000 | 99.930000 | 93.25000 | 90.430000 | 94.710000 | 86.320000 | 96.480000 | 87.280000 | 80.000000 | 92.030000 | 98.040000 | 90.310000 | 97.490000 | 92.040000 | 95.150000 | 79.200000 | 86.820000 | 86.120000 | 89.520000 | 77.700000 |
Now, I will continue to the analysis of WQI.
According to [1] WQI (Water Quality Index) stratifies water quality with respect to a degree of "concern" into three groups where the lower WQI values indicate higher concerns to improve water quality in a given area:
md(f"We can start off this analysis by plotting the value for WQI for all "
f"available years, i.e., {YEAR_MIN}-{YEAR_MAX} from all available stations."
f" I added the lines that stratify WQI values.")
We can start off this analysis by plotting the value for WQI for all available years, i.e., 1970-2023 from all available stations. I added the lines that stratify WQI values.
"""
Plot WQI lines in plotly
@param fig_ : (in/out) figure to be modified by adding two horizontal lines
"""
def add_WQI_lines(fig_):
# high concern line
fig_.add_hline(y=40, line_dash="dash", line_color="red")
#annotation_position="top", annotation_text="High concern" )
fig_.add_hline(y=80, line_dash="dash", line_color="green")
#annotation_position = "bottom", annotation_text="Low concern")
fig_.update_xaxes(dtick=2)
fig_.update_yaxes(dtick=10)
return fig_
# -----------------------
# plot with lines
# -----------------------
def plot_hlines():
plt.axhline(y=40, color='r', linestyle='dashed')
plt.axhline(y=80, color='b', linestyle='dashed')
"""Create dataframe describing stats of the columns: min, max, and median
@param dat_fr: in
"""
def crt_stat_df(dat_fr):
return pd.DataFrame(
{'min' : dat_fr.min(axis=0),
'max' : dat_fr.max(axis=0),
'median' : dat_fr.median(axis=0)}
).sort_values('median', ascending=False)
"""
In case you have more lines than the colors present in the colormap, this is
how you can construct a custom colorscale so that you get one complete sequence
instead of a cycling sequence:
@param base_color_seq (in) The base color sequence we are interested in
@param colors_count The final number of colors that you want to have in
this color scale
@return The custom color scale, e.g., px.colors.sequential.RdBu
"""
def get_custom_colorscale(base_color_seq, colors_count):
return px.colors.sample_colorscale(
colorscale = base_color_seq,
samplepoints=colors_count,
low=0.0,
high=1.0,
colortype="rgb")
"""
Prepare the figure for all WQI in a dataframe and all stations
@param df_ The dataframe that contains all that WQI information
@return f A plotly figure ready to be shown()
"""
def yearly_WQI_all_fig(df_):
#df_an.plot(title = "AnnualScore WQI for all locators in Kings County.\n"
# "According to [1] stations with WQI < 40 are of 'high concern',\n"
# "[40; 80) are of 'modern concern', >= 80 'low concern.'",
# legend=False,
# ylabel= "WQI")
#plot_hlines()
#plt.show()
# sort the columns according to median in a descending order
# get the median of the columns and sort the values so
# and then start from the highest median to the lowest
sorted_cols = (df_.median(axis = 0)).sort_values(ascending=False).index
f = px.line(df_, x = df_.index, y = sorted_cols,
labels = {"value" : "WQI",
"variable" : "Station ID"},
title = "AnnualScore WQI for all locators in Kings County, WA,"
f" for years {YEAR_MIN}-{YEAR_MAX}. The legend is sorted in a "
"descending order according to stations' WQI median. "
f"{len(df_.columns)} stations.",
height = 900,
# get the reverse colors, i.e., do for your plotly color sequence [::-1]
color_discrete_sequence=get_custom_colorscale(px.colors.sequential.RdBu[::-1], len(sorted_cols)),
markers = True
)
f.add_annotation(x='1970', y=40, text="High concern")
f.add_annotation(x='1970', y=80, text="Low concern")
# prevent from changing the scale when turning off/on locators
f.update_yaxes(range = [-5, 100])
return add_WQI_lines(f)
yearly_WQI_all_fig(df_an).show()
As you can see, the plot is difficult to read and introduces many concerns. We can see that LSIN1 has been recently having a down spike in WQI, although it in the past it had high WQI indicators. The figure also shows interruptions and resumptions of the service at some location.
Section Stations Locations gave us some overview regarding the aerial coverage and timespan of recorded WQIs.
The data we have shows interruptions, additions, as well as resumptions of stations servicing the area. This analysis is focused more on a general status of the region rather than on individual cases and as such it is important to answer the question how many records for a given station we can have to perform further analysis
The following section deals with this issue.
Let's take a look at how many reports we should expect if locators, which is another word for stations measuring WQI, lasted for the entire available period.
md(f"If we consider all available years, i.e, {YEAR_MIN} - {YEAR_MAX}, we "
f"should have {YEAR_MAX-YEAR_MIN+1} records per station.")
If we consider all available years, i.e, 1970 - 2023, we should have 54 records per station.
The histogram below shows frequency of available records with respect to locators.
"""Get the count of non-null values in dataframe
@param df_ (in) : data frame to be examined
@return a series with frequencies sorted in a descending manner
"""
def get_and_sort_freqs(df_):
# get the number of counts of non-null values in, sorting
# to get extra info later
return df_.count().sort_values(ascending = False)
rec_freq = get_and_sort_freqs(df_an)
""" Shows the frequency plot for series_
@param series_ What we want to present
@param title_ The title to be shown
@return the plot to be shown
"""
def plot_freqs(series_, title_):
return px.histogram(series_,
title=title_,
labels = { 'value': '# of Available Records'},
text_auto = True,
marginal='box'
).update_layout(showlegend = False)
plot_freqs(rec_freq, f"Frequency of available records for timeframe {YEAR_MIN}-{YEAR_MAX} with respect to locators.").show()
#print(rec_freq.index[0])
#print(my_locator)
my_locator = loc_df.loc[loc_df['Locator'] == rec_freq.index[0], 'SiteName']
#print(my_locator.to_string(index = False))
md(f"There is only one station with all records available, i.e., {rec_freq[0]}, "
f"available and it is *{my_locator.to_string(index = False)}*, {rec_freq.index[0]}."
"The median is 34 records.")
There is only one station with all records available, i.e., 54, available and it is Piper's Creek, KTHA01.The median is 34 records.
rec_freq
KTHA01 54 3106 53 430 51 474 51 470 51 B319 50 434 50 478 50 311 49 322 49 631 48 446 48 486 48 440 48 D320 48 317 48 A315 47 484 47 G320 46 C320 46 A320 46 B484 46 632 46 C484 45 442 45 N484 45 A631 45 A319 45 444 44 A438 44 A432 44 A499 42 A456 42 S484 38 KSHZ06 37 J484 36 C370 36 321 36 D474 34 KTHA03 34 C446 33 KTHA02 32 A680 31 438 30 F321 29 A620 29 A690 26 A617 26 A685 25 X630 24 LSIN1 23 LSIN9 22 BB470 20 S478 18 VA42A 17 VA12A 16 VA41A 16 VA45A 16 0450CC 16 CHERRY_1 16 HARRIS_1 15 VA65A 15 AMES_1 15 PATTER_3 15 MFK_SNQ 13 TOLT_MTH 13 SNQDUVALL 13 SKYKOMISH 13 SFK_SNQ 13 BSE_1MUDMTNRD 13 RAGING_MTH 13 NFK_SNQ 13 GRIFFIN 13 VA37A 12 A670 11 B499 7 A687 5 0484A 3 dtype: int64
Now, it is time to decide which stations take into account for our analysis. It will determine how many records will be available at least per station, and which years in the recorded history will be missing. In the below table is a summary of my findings based on which I decided to go with Group B, which ensures that at most 15% of records can be missing, i.e., 46 records available at least per station, the missing records can include years $\le 1978$ and/or [2010; 2014] inclusive. Please skip to Section if you do want to skip straight to the WQI analysis.
| Group | No. of Avail Recs ($\ge$) | No. of Missing Recs ($\le$) | % Missing ($\le$) | No. of Stations | Area Covered |
|---|---|---|---|---|---|
| A | 49 | 5 | 9% | 10 | Northern, Tukwila, south of Maple Valley |
| B | 46 | 8 | 15% | 23 | Eastern, central, southern |
| C | 43 | 11 | 20% | 31 | Eastern, central |
Note that $A \supseteq B \supseteq C$, as the group $A$ has the strictest constraints. Group B covers the area that is covered by group A; it also covers more eastern, central and southern parts of the region (similarly with group C).
If we take a look at Group A that offers at least 49 records, i.e., 9% records missing at most and we get 10 stations. They cover three regions: northern parts of the region, two stations at Tukwila, and south of Maple Valley at the Green River. If they miss records it might be records from 1978 and/or prior to 1978. There is one station that, in addition, misses records [2010; 2014] (inclusive).
If we consider stations with at least 46 records available, i.e., 15% records missing at most, the number of stations increases to 23, and the area coverage includes additional stations in the eastern, central and southern parts of the region. Apart from records missing reports $\le 1978$, now there are 13 stations missing at least one record within [2010; 2014] (inclusive) and there is one station missing reports also for years [2022; 2023].
If we relax a constraint to at least 43 records available per station, i.e., $\le$ 20% missing, we have 31 stations that cover even more of the central part and a little bit more of the eastern part.
# get the stations with records the largest amount of records
"""
Get the locators with the at least rec_count records available
@param rec_count how many records do we want to have
@param freqs the frequencies series, I assume the
freqs is a series with a decreasing order
@return list of locators with at least rec_count
"""
def get_locators_with_at_least(rec_count, freqs):
return [x for x in freqs.index if freqs[x] >= rec_count]
"""
Get the years that are missing
@param loc_list The list of the locators to be checked.
@param dat_fr The data frame to be checked
@return the list of missing years against dat_fr
"""
def get_missing_years(loc_list, dat_fr):
res = []
for ll in loc_list:
ss = pd.isnull(dat_fr[ll])
years = ss[ss == True].index.to_list()
if (len(years) != 0):
res.append([ll, years])
#print(f"{ll}: {years}")
return res
"""
print locators and missing years and returns the frame of locators
@param rec_count in: the least number of records
@param freqs in: the frequencies series, I assume the freqs is a series
in a decreasing order
@param dat_fr in: the data frame to be used
@ret locs - a frame of stations that has at least rec_count records available
"""
def print_locs_and_missing_years(rec_count, freqs, dat_fr):
locs = get_locators_with_at_least(rec_count, freqs)
total_rec = YEAR_MAX - YEAR_MIN + 1
perc_missing_no_of_rec = 100 - round((rec_count / total_rec) * 100)
missing_no = total_rec - rec_count
print(f"---------\nRecords avail. >= rec_count={rec_count}, "
f"<= {perc_missing_no_of_rec}% missing, "
f"missing records (at most) {missing_no}\n"
f"locators={locs}\nlen(locs) = {len(locs)}")
print(f"\nMissing years:")
missing_years = get_missing_years(locs, dat_fr)
# print a list of lists with linebreaks
print('\n'.join(map(str, missing_years)))
return locs
"""
Prepare the density mapbox figure for a location
@param dat_fr (in) a dataframe showing the locators ids satifying
a frequency condition
@param location_years_df (in) dataframe showing the years for a given locator
@ret density_mapbox figure
"""
def get_fig_density(dat_fr, location_years_df, descr=""):
# get the location of locators
l_df = location_years_df.loc[location_years_df['Locator'].isin(dat_fr)]
f = px.density_mapbox(l_df, lat = "lat", lon = "lng", radius = 10,
center = dict (lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG), zoom = 8,
hover_name = "SiteName", hover_data = "Locator",
mapbox_style = MAP_STYLE, title = descr,
width = 800, height = 900)
return f
"""Show the density mapbox for a location
@param dat_fr (in) a dataframe showing the locators ids satifying
a frequency condition
@param location_years_df (in) dataframe showing the years for a given locator
"""
def show_density(dat_fr, location_years_df, descr=""):
f = get_fig_density(dat_fr, location_years_df, descr)
f.show()
""" Show data that will help to determine which stations select
for further analysis: print some descriptive stats and
plot figures.
@param rec_count in: the least number of records
@param freqs in: the frequencies series, I assume the freqs is a series
in a decreasing order
@param dat_fr in: the data frame to be used
@ret locs - a frame of stations that has at least rec_count records available
"""
def do_loc_analysis(freqs, location_years_df, dat_fr):
# available years
avail_years = [49, 46, 43]
figs = []
for y in avail_years:
fr = print_locs_and_missing_years(y, freqs, dat_fr)
print()
figs.append(get_fig_density(fr, location_years_df,
f"Stations with at least {y} records available for "
f"{YEAR_MIN}-{YEAR_MAX}. Stations count = {len(fr)}"))
# show_density(fr, location_years_df,
# f"Stations with at least {y} records available for {YEAR_MIN}-{YEAR_MAX}.")
[f.show() for f in figs]
do_loc_analysis(rec_freq, loc_year_df, df_an)
--------- Records avail. >= rec_count=49, <= 9% missing, missing records (at most) 5 locators=['KTHA01', '3106', '430', '474', '470', 'B319', '434', '478', '311', '322'] len(locs) = 10 Missing years: ['3106', [1970]] ['430', [1970, 1971, 1978]] ['474', [1970, 1971, 1978]] ['470', [1970, 1971, 1978]] ['B319', [1970, 1971, 1974, 1975]] ['434', [1970, 1971, 1973, 1978]] ['478', [1970, 1971, 1975, 1978]] ['311', [2010, 2011, 2012, 2013, 2014]] ['322', [1970, 1971, 1974, 1975, 1976]] --------- Records avail. >= rec_count=46, <= 15% missing, missing records (at most) 8 locators=['KTHA01', '3106', '430', '474', '470', 'B319', '434', '478', '311', '322', '631', '446', '486', '440', 'D320', '317', 'A315', '484', 'G320', 'C320', 'A320', 'B484', '632'] len(locs) = 23 Missing years: ['3106', [1970]] ['430', [1970, 1971, 1978]] ['474', [1970, 1971, 1978]] ['470', [1970, 1971, 1978]] ['B319', [1970, 1971, 1974, 1975]] ['434', [1970, 1971, 1973, 1978]] ['478', [1970, 1971, 1975, 1978]] ['311', [2010, 2011, 2012, 2013, 2014]] ['322', [1970, 1971, 1974, 1975, 1976]] ['631', [1970, 1971, 1973, 1974, 1977, 1978]] ['446', [1970, 1971, 1972, 1973, 1974, 1978]] ['486', [1970, 1971, 1972, 1973, 1974, 1975]] ['440', [1970, 1971, 1973, 1974, 1975, 1976]] ['D320', [1970, 1971, 1973, 1974, 1975, 1976]] ['317', [1970, 1971, 1972, 1974, 1975, 1976]] ['A315', [1970, 1971, 1972, 1973, 1974, 1975, 1976]] ['484', [1970, 1971, 1973, 1975, 1978, 2022, 2023]] ['G320', [1970, 1971, 1974, 1975, 1976, 2010, 2011, 2012]] ['C320', [1970, 1971, 1973, 1974, 1975, 1976, 2010, 2011]] ['A320', [1970, 1971, 1973, 1974, 1975, 1976, 1977, 1978]] ['B484', [1970, 1971, 1974, 1975, 1978, 2010, 2011, 2012]] ['632', [1970, 1971, 1973, 1974, 1977, 1978, 2012, 2014]] --------- Records avail. >= rec_count=43, <= 20% missing, missing records (at most) 11 locators=['KTHA01', '3106', '430', '474', '470', 'B319', '434', '478', '311', '322', '631', '446', '486', '440', 'D320', '317', 'A315', '484', 'G320', 'C320', 'A320', 'B484', '632', 'C484', '442', 'N484', 'A631', 'A319', '444', 'A438', 'A432'] len(locs) = 31 Missing years: ['3106', [1970]] ['430', [1970, 1971, 1978]] ['474', [1970, 1971, 1978]] ['470', [1970, 1971, 1978]] ['B319', [1970, 1971, 1974, 1975]] ['434', [1970, 1971, 1973, 1978]] ['478', [1970, 1971, 1975, 1978]] ['311', [2010, 2011, 2012, 2013, 2014]] ['322', [1970, 1971, 1974, 1975, 1976]] ['631', [1970, 1971, 1973, 1974, 1977, 1978]] ['446', [1970, 1971, 1972, 1973, 1974, 1978]] ['486', [1970, 1971, 1972, 1973, 1974, 1975]] ['440', [1970, 1971, 1973, 1974, 1975, 1976]] ['D320', [1970, 1971, 1973, 1974, 1975, 1976]] ['317', [1970, 1971, 1972, 1974, 1975, 1976]] ['A315', [1970, 1971, 1972, 1973, 1974, 1975, 1976]] ['484', [1970, 1971, 1973, 1975, 1978, 2022, 2023]] ['G320', [1970, 1971, 1974, 1975, 1976, 2010, 2011, 2012]] ['C320', [1970, 1971, 1973, 1974, 1975, 1976, 2010, 2011]] ['A320', [1970, 1971, 1973, 1974, 1975, 1976, 1977, 1978]] ['B484', [1970, 1971, 1974, 1975, 1978, 2010, 2011, 2012]] ['632', [1970, 1971, 1973, 1974, 1977, 1978, 2012, 2014]] ['C484', [1970, 1971, 1972, 1973, 1975, 1978, 2010, 2011, 2012]] ['442', [1970, 1971, 1973, 1974, 1975, 1976, 2010, 2011, 2012]] ['N484', [1970, 1971, 1972, 1973, 1975, 1978, 2010, 2011, 2012]] ['A631', [1970, 1971, 1973, 1974, 1977, 1978, 2010, 2011, 2012]] ['A319', [1970, 1971, 1973, 1974, 1975, 2010, 2011, 2013, 2014]] ['444', [1970, 1971, 1972, 1973, 1974, 1975, 1976, 2010, 2011, 2012]] ['A438', [1970, 1971, 1973, 1974, 1975, 2010, 2011, 2012, 2013, 2014]] ['A432', [1970, 1971, 1972, 1973, 1974, 1975, 1978, 2010, 2011, 2012]]
In order to analyze how WQI has been changing over the years, I have chosen Group B explained in Section. In summary: there are 23 stations and each of the stations will miss at most 15% of possible records, i.e., 9, for the period 1970-2023 (for various reasons, one of them might be that a station might not exist at a given time or it had interruptions in service.) The missing records can include years $\le 1978$ and/or [2010; 2014] inclusive. The area covers stations located in eastern, central, northern, and southern parts of the region.
# df_an holds the column as locators, index as years and values as WQI
df_an.head()
| 0450CC | 0484A | 3106 | 311 | 317 | 321 | 322 | 430 | 434 | 438 | 440 | 442 | 444 | 446 | 470 | 474 | 478 | 484 | 486 | 631 | 632 | A315 | A319 | A320 | A432 | A438 | A456 | A499 | A617 | A620 | A631 | A670 | A680 | A685 | A687 | A690 | AMES_1 | B319 | B484 | B499 | BB470 | BSE_1MUDMTNRD | C320 | C370 | C446 | C484 | CHERRY_1 | D320 | D474 | F321 | G320 | GRIFFIN | HARRIS_1 | J484 | KSHZ06 | KTHA01 | KTHA02 | KTHA03 | LSIN1 | LSIN9 | MFK_SNQ | N484 | NFK_SNQ | PATTER_3 | RAGING_MTH | S478 | S484 | SFK_SNQ | SKYKOMISH | SNQDUVALL | TOLT_MTH | VA12A | VA37A | VA41A | VA42A | VA45A | VA65A | X630 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WaterYear | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1970 | NaN | NaN | NaN | 70.93 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 37.89 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1971 | NaN | NaN | 48.31 | 61.14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 23.36 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1972 | NaN | NaN | 67.97 | 74.90 | NaN | 83.43 | 77.38 | 80.32 | 84.46 | 88.33 | 89.09 | 95.55 | NaN | NaN | 86.95 | 87.80 | 91.22 | 90.80 | NaN | 89.5 | 88.82 | NaN | 79.75 | 89.12 | NaN | 93.45 | NaN | NaN | NaN | NaN | 83.5 | NaN | NaN | NaN | NaN | NaN | NaN | 71.03 | 87.38 | NaN | NaN | NaN | 92.2 | NaN | NaN | NaN | NaN | 91.42 | NaN | NaN | 84.40 | NaN | NaN | NaN | NaN | 47.24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1973 | NaN | NaN | 67.19 | 75.38 | 92.03 | 84.14 | 60.90 | 61.22 | NaN | NaN | NaN | NaN | NaN | NaN | 66.29 | 55.99 | 62.88 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 85.98 | 54.10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 49.08 | NaN | 67.46 | NaN | NaN | NaN | NaN | 51.35 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1974 | NaN | NaN | 87.43 | 83.90 | NaN | NaN | NaN | 37.59 | 41.80 | NaN | NaN | NaN | NaN | NaN | 39.42 | 30.67 | 38.24 | 36.28 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 51.15 | NaN | NaN | 32.02 | NaN | NaN | NaN | NaN | 63.2 | NaN | 63.21 | NaN | NaN | NaN | NaN | NaN | 64.63 | NaN | NaN | NaN | 54.61 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
""" get the group as a dataframe with a at least_records available
at_least_rec in: (int) The number of records at least available at the stations
freqs in: data frame with frequency of stations over the years
dat_fr in: a main data frame with all records
"""
def get_gr_df(at_least_rec, freqs, dat_fr):
# get the list list of locators/stations with at least
gr_list = get_locators_with_at_least(at_least_rec, freqs)
# this is the list of group C WQI values
wqi_gr_df = dat_fr[gr_list]
#print(gr_list)
#print(dat_fr[gr_list])
#len(df_an[gr_list].columns)
return wqi_gr_df
# our group C dataframe
# how many records at least in each station
C_GRP_COUNT=46
grp_c_df = get_gr_df(C_GRP_COUNT, rec_freq, df_an)
#type(grp_c_df)
#print(f"grp_c_df.index={grp_c_df.index}")
#l_df = loc_year_df.loc[loc_year_df['Locator'].isin(gr_c_list)]
#print(l_df)
#print(l_df[l_df["Locator"] == 'KTHA01'])
#df_an['KTHA01']
Let's see how in general WQI was performing over the years, what min/max and median were.
# get grp C stats
grp_c_stats_df = crt_stat_df(grp_c_df).reset_index()
grp_c_stats_df = grp_c_stats_df.rename(columns= {'index' : 'Locator'})
grp_c_stats_df.index.name = 'index'
grp_c_stats_df
| Locator | min | max | median | |
|---|---|---|---|---|
| index | ||||
| 0 | B319 | 58.92 | 94.74 | 85.250 |
| 1 | D320 | 73.35 | 95.11 | 84.845 |
| 2 | C320 | 68.45 | 94.93 | 83.760 |
| 3 | A320 | 27.19 | 92.96 | 78.475 |
| 4 | 440 | 27.99 | 91.89 | 68.040 |
| 5 | G320 | 43.09 | 95.15 | 65.945 |
| 6 | 631 | 33.77 | 89.50 | 64.695 |
| 7 | 486 | 40.15 | 86.74 | 61.290 |
| 8 | 311 | 33.84 | 86.58 | 61.140 |
| 9 | 478 | 20.69 | 91.22 | 58.675 |
| 10 | 632 | 32.25 | 88.82 | 58.050 |
| 11 | KTHA01 | 23.36 | 82.78 | 57.835 |
| 12 | 470 | 28.03 | 89.83 | 57.450 |
| 13 | B484 | 22.78 | 87.38 | 57.315 |
| 14 | 3106 | 26.74 | 87.43 | 55.100 |
| 15 | 484 | 7.82 | 90.80 | 53.660 |
| 16 | 446 | 8.15 | 81.37 | 47.735 |
| 17 | 430 | 13.00 | 88.72 | 47.200 |
| 18 | 474 | 13.42 | 87.80 | 45.720 |
| 19 | 322 | 4.49 | 80.71 | 39.330 |
| 20 | 434 | 12.18 | 84.46 | 38.820 |
| 21 | A315 | 1.00 | 96.14 | 37.420 |
| 22 | 317 | 1.00 | 92.03 | 20.260 |
"""Show the Group C statistics with a plot
@param grp_df_ : in : the dataframe that contains statistics
@param rec_count : in: how many records in the group
@param period : in :
"""
def plot_grp_stats(grp_df_, rec_count, period, tickangle_ = 0):
#sorted_df.grp_df_.median(axis = 0).sort_values()
#print(sorted_df)
sorted_df = grp_df_.median(axis = 0).sort_values(ascending = False)
#print(sorted_df.index)
#print(f"grp_df_=\n{grp_df_.head()}")
fig = px.box(grp_df_, y = sorted_df.index, points='all',
title=f"Descriptive stats for stations in King County, WA, with at least {rec_count}"
f" records for the period {period}.",
labels = { "value" : "WQI",
"variable" : "Stations IDs (Locators)"}
)
fig.add_hline(y=40, line_dash="dash", line_color="red")
fig.add_annotation(x='D320', y=40, text="High concern")
fig.add_hline(y=80, line_dash="dash", line_color="green")
fig.add_annotation(x='D320', y=80, text="Low concern", ay=50)
fig.update_xaxes(tickangle = tickangle_)
return fig
plot_grp_stats(grp_c_df, C_GRP_COUNT, f"{YEAR_MIN}-{YEAR_MAX}; {len(grp_c_df.columns)} stations reporting").show()
As we can see, only three stations have a median about the Low concern level, and four fall into the category of High concern, including median 21.43 for Springbrook Creek mouth at SW 16th St close to Renton (Locator 317). Majority, i.e., 16 (about 70%) stations reported values with median within the Modern concern range. It poses the question where the stations are located? Does it depends on a geographical location? The following viz attempts to shed some light on it.
"""
Prepares the figure based on several dataframes
@param df_ The main WQI frame
@param location_df The dataframe with stations' location
@param wqi_s_ The series such as median(), max(), min()
@param col_name The name of the new column that describes the series
wqi_s_
@param descr Title for the figure
@ret figure
"""
def get_fig_density_scatter(df_, location_df, wqi_s_, col_name_, descr=""):
# get the location of locators
l_df = location_df.loc[location_df['Locator'].isin(df_)].reset_index()
l_df.index.name = 'index'
# so your index starts with 0, 1, 2,
wqi_df = wqi_s_.to_frame(name=col_name_).reset_index()
wqi_df = wqi_df.rename(columns= {'index' : 'Locator', 0 : col_name_})
# otherwise the name of the index is None
wqi_df.index.name = 'index'
# merge the dataframe with wqi values into a dataframe having
# the location, stations names and stations' ids so we can
# nicely plot the clean locations
l_df = l_df.merge(wqi_df[['Locator', col_name_]], on = 'Locator')
print(f"l_df={l_df[['SiteName', 'Locator', col_name_]].sort_values(col_name_, ascending = False)}")
#print(f"wqi_df={wqi_df}")
# stratify it because the trend is difficult to observe otherwise
bins = [0, 40, 80, np.inf]
# [0;40), [40;80), [80; ...), and since I can't find
# how to change a legend, so I have to do it with the long names now
# the dataframe is not too big anyway, so adding extra bytes ...
names = ['<40 high concern', '[40; 80) modern concern', '>=80 low concern']
class_name = 'WQI_class'
# get the cut with left inclusive and right exclusive
l_df[class_name] = pd.cut(l_df[col_name_], bins, labels=names)
#print(l_df)
# this shows the continues scale but it does not show to much
# so I decided to go with scatter_mapbox
# uncomment if you want to see the continuous_scale
#f = px.density_mapbox(l_df, lat = "lat", lon = "lng", radius = 10,
# center = dict (lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG), zoom = 8,
# hover_name = "SiteName", hover_data = ["Locator", col_name_],
# mapbox_style = MAP_STYLE, title = descr,
# z = col_name_,
# color_continuous_scale='RdBu',
# width = 800, height = 900)
f = px.scatter_mapbox(l_df, lat = "lat", lon = "lng",
center = dict (lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG), zoom = 8,
hover_name = "SiteName", hover_data = ["Locator", col_name_, class_name],
mapbox_style=MAP_STYLE, title = descr,
size = col_name_,
size_max = 20,
color = class_name,
#labels = { "legend" : {"-1" : "<40", "0" : "[40; 80)", "1" : ">=80"} },
#legend = {"-1" : "<40", "0" : "[40; 80)", "1" : ">=80"},
color_discrete_map = {
names[0] : "red",
names[1] : "yellow",
names[2] : "blue"
},
category_orders = {
class_name : [names[2], names[1], names[0]]
},
width = 800, height = 900)
#f.update_layout(legend={ labels : {"-1" : "<40", "0" : "[40; 80)", "1" : ">=80"}})
f.update_layout(legend_title_text="WQI class")
return f
get_fig_density_scatter(grp_c_df,loc_df, grp_c_df.median(axis = 0),
"WQI_median",
f"Stations in King County, WA, with at least {C_GRP_COUNT} records for "
f"the period {YEAR_MIN}-{YEAR_MAX}.").show()
l_df= SiteName Locator WQI_median 17 Green River at 212th Way SE B319 85.250 20 Jenkins Creek at Kent Black Diamond Rd SE D320 84.845 19 Covington Creek at SE Auburn-Black Diamond Rd C320 83.760 16 Soos Creek mouth below Soos Creek Hatchery A320 78.475 5 May Creek mouth at Lake Washington Blvd N 440 68.040 21 Little Soos Creek at SE 272nd St G320 65.945 12 Issaquah Creek mouth at NW Sammamish Rd 631 64.695 11 Sammamish River at NE Marymoor Way 486 61.290 0 Green River at Interurban 311 61.140 9 Little Bear mouth near NE 178th St 478 58.675 13 Issaquah North Fork at mouth 632 58.050 22 Piper's Creek KTHA01 57.835 7 Swamp Creek mouth at Bothell Way NE 470 57.450 18 Evans Creek at NE Union Hill Rd B484 57.315 14 Green River at Starfire Way 3106 55.100 10 Bear Creek downstream of Redmond Way 484 53.660 6 Juanita Creek mouth at NE Juanita Dr 446 47.735 3 Lyon Creek mouth at Bothell Way NE 430 47.200 8 North Creek mouth at Sammamish River Trail 474 45.720 2 Newaukum Creek mouth at 212th Way SE 322 39.330 4 Thornton Creek mouth at Sand Point Way NE 434 38.820 15 Mill Creek near 68th Ave S and S 262nd St A315 37.420 1 Springbrook Creek mouth at SW 16th St 317 20.260
We can see that the Low concern stations are only in the south-eastern part of the region, close to Covington area:
| Station | Locator | Median WQI |
|---|---|---|
| Green River at 212th Way SE | B319 | 85.250 |
| Jenkins Creek at Kent Black Diamond Rd SE | D320 | 84.845 |
| Covington Creek at SE Auburn-Black Diamond Rd | C320 | 83.760 |
High concern stations locations are in the northern, and southern parts of the area. Surprisingly, one High concern station, namely Newaukum Creek mouth at 212th Way SE (322) with median WQI equaling 39.330 is in vicinity of the Low concern stations. The lowest median in the region belongs to Springbrook Creek mouth at SW 16th St (317), the station close to Renton, with median WQI = 20.260.
| Station | Locator | Median WQI |
|---|---|---|
| Newaukum Creek mouth at 212th Way SE | 322 | 39.330 |
| Thornton Creek mouth at Sand Point Way NE | 434 | 38.820 |
| Mill Creek near 68th Ave S and S 262nd St | A315 | 37.420 |
| Springbrook Creek mouth at SW 16th St | 317 | 20.260 |
The figure shows the worst WQI in terms of median values were reported at the stations: 317, A315, 434, and 322. All of them but 434 seem to be small creeks. On the contrary, only one of three the best WQIs, B319 seems to be measuring quality of a big river water, Green River; the map does not show any creeks for two of them.
It would be interesting to see how WQI changed over the years. Do changes depend on the year?
# how we name strata with respect to WQI
CONCERN_STRATA = {"l" : "Low concern", "m" : "Modern concern", "h" : "High concern" }
""" Stratifies the frame according to a median column into
three classes: [80; ...), [40;80), [0;40) and names
it according to CONCERN_STRATA
@param df_ (in): a dataframe that has a column named 'median'
@return The dataframe with a stratified column named WQI_class
and values h, m, l depending on the median
"""
def get_stratified_median_df(df_):
l_df = df_
bins = [0, 40, 80, np.inf]
# [0;40), [40;80), [80; ...)
names = [CONCERN_STRATA["h"],CONCERN_STRATA["m"], CONCERN_STRATA["l"]]
class_name = 'WQI_class'
# get the cut with left inclusive and right exclusive
l_df[class_name] = pd.cut(l_df['median'], bins, labels=names)
return l_df
#get_stratified_median_df(grp_c_stats_df)
"""Plots yearly WQI and allows to select the locators easier by
group selection and deselection
@param df_ the group that you want to work with, should have years and WQI
grouped by Locators as columns
@param stratified_df_ a supporting dataframe that one column that
classifies the locators per given metric such as median
into groups that can be selected or deselected.
@param title_ The title for the figure
@return the Dash app
"""
def plot_yearly_WQI(df_, stratified_df_, title_):
app = Dash(__name__)
# we don't need 'min' and 'max' column
stratified_df_ = stratified_df_.drop(['min', 'max'], axis=1)
# get the reverse colors, i.e., do for your plotly color sequence [::-1]
# and we don't want to make that color scale adjust every time within
# that palette; I always want to have bad WQIs in the same red and
# good WQIs in the same blue
stratified_df_['Colorscale'] = get_custom_colorscale(
px.colors.sequential.RdBu[::-1], len(stratified_df_.index))
#print(stratified_df_)
#print(df_.head())
app.layout = html.Div([
#html.H4("AnnualScore WQI for all locators in Kings County, WA,"
# f" with at least {C_GRP_COUNT} records available"
# f" for years {YEAR_MIN}-{YEAR_MAX}. The legend is sorted in a "
# "descending order according to stations' WQI median."),
dcc.Graph(id="graph"),
dcc.Checklist(
id="checklist",
options=[CONCERN_STRATA["l"], CONCERN_STRATA["m"], CONCERN_STRATA["h"]],
value=[CONCERN_STRATA["l"], CONCERN_STRATA["m"], CONCERN_STRATA["h"]],
inline=True,
),
])
@app.callback(
Output("graph", "figure"),
Input("checklist", "value"))
def update_line_chart(wqi):
# check which locators qualify for the choice
mask = stratified_df_.WQI_class.isin(wqi)
tmp_df = stratified_df_[mask]
qualified_locators = tmp_df['Locator'].to_list()
colorscale = tmp_df['Colorscale'].to_list()
f = px.line(df_, x = df_.index, y = qualified_locators,
labels = {"value" : "WQI",
"variable" : "Station ID"},
title = title_,
height = 800,
color_discrete_sequence=colorscale,
markers = True,
template = "plotly_dark"
)
f.update_yaxes(range = [-5, 100])
f.add_annotation(x=df_.index[0], y=40, text="High concern")
f.add_annotation(x=df_.index[0], y=80, text="Low concern")
return add_WQI_lines(f)
return app
plot_yearly_WQI(grp_c_df, get_stratified_median_df(grp_c_stats_df),
"AnnualScore WQI for all locators in Kings County, WA,"
f" with at least {C_GRP_COUNT} records"
f" for years {YEAR_MIN}-{YEAR_MAX}. The legend is sorted in a "
"descending order according to stations' WQI median. "
f"Total {len(grp_c_df.columns)} stations.").run_server(debug=True)
A few observations related to the historical values of WQI:
WQI $\approx$ 60 in 1993 and 1994, and more recently in 2007 where WQI = 67.
2022, $\Delta$ WQI $\approx$ [17, 30]. It might be because reporting for 2023 covers only part of year 2023. This locators' group seems to be on a recovery track. The worst times seem to be behind:
| Locators | Years when WQI was extremely low | $\cap$ (Hill) years |
|---|---|---|
| 322 | 1990-2002 | 2010-2017 |
| 434 | 1986-2007 | - |
| A315 | 1980-2002 | 2007-2015 |
| 317 | 1977-1992 | 1992-2002 |
All of those stations apart from 434 show an interesting hill, i.e., increase-decrease in the past; however, in different timeframes. The locators recorded a relatively high WQI (A315, WQI $\approx$ 96 in 2010; 312, WQI $\approx$ 78 in 2012) and then a big decrease in water quality.
occasionally going into both Low concern and High concern zone.
In order to understand how WQI has been shaped for the entire region, it would be interesting to compare how the median for the entire region was changing over the years.
"""get sorted columns according to ascending WQI for a list of locators
@param df_ (in): the frame that contains years and WQIs
@param loc_list_ (in): the list of locators we are interested in
@return dataframe with columns corresponding to WQIs and corresponding
years the WQI was reported, WQI sorted in ascending order
"""
def get_years_min_vals(df_, loc_list_):
min_df = pd.DataFrame()
for l in loc_list_:
df = df_[l]
df = df.sort_values().reset_index()
#print(df)
names = {"y" : f"{l}_year", "l" : f"{l}_WQI"}
min_df[names["l"]] = df[l]
min_df[names["y"]] = df['WaterYear']
return min_df
get_years_min_vals(grp_c_df, ['322', '434', 'A315', '317'])
| 322_WQI | 322_year | 434_WQI | 434_year | A315_WQI | A315_year | 317_WQI | 317_year | |
|---|---|---|---|---|---|---|---|---|
| 0 | 4.49 | 1991 | 12.18 | 2007 | 1.00 | 1980 | 1.00 | 1990 |
| 1 | 9.89 | 1996 | 15.88 | 1986 | 2.91 | 1979 | 1.00 | 1977 |
| 2 | 12.66 | 1998 | 17.56 | 1992 | 4.45 | 1984 | 1.00 | 1979 |
| 3 | 15.14 | 1990 | 18.26 | 1999 | 8.54 | 1991 | 1.00 | 1983 |
| 4 | 19.87 | 2000 | 20.21 | 2004 | 8.63 | 1989 | 1.00 | 1984 |
| 5 | 20.08 | 1984 | 21.53 | 2006 | 9.70 | 1998 | 1.19 | 1985 |
| 6 | 23.69 | 2017 | 22.94 | 1976 | 9.79 | 1985 | 1.40 | 1988 |
| 7 | 24.90 | 2002 | 23.54 | 1997 | 10.34 | 1983 | 2.36 | 1986 |
| 8 | 26.19 | 1983 | 23.62 | 2008 | 10.46 | 1990 | 4.13 | 1992 |
| 9 | 26.91 | 1993 | 25.87 | 1998 | 11.87 | 1981 | 4.25 | 1987 |
| 10 | 28.42 | 1997 | 27.19 | 1981 | 13.81 | 1978 | 4.42 | 1980 |
| 11 | 28.90 | 1995 | 27.28 | 2003 | 14.04 | 2002 | 5.26 | 1989 |
| 12 | 30.35 | 2001 | 27.38 | 1996 | 15.01 | 1986 | 6.66 | 2004 |
| 13 | 31.25 | 1999 | 27.83 | 1993 | 15.27 | 1987 | 6.68 | 2002 |
| 14 | 33.65 | 1988 | 31.26 | 1995 | 19.47 | 1997 | 7.03 | 1991 |
| 15 | 34.12 | 1979 | 31.37 | 2005 | 21.70 | 1988 | 7.98 | 2006 |
| 16 | 34.13 | 2007 | 31.54 | 1980 | 22.82 | 1982 | 9.43 | 1982 |
| 17 | 34.45 | 2019 | 33.16 | 2015 | 23.61 | 1999 | 11.07 | 1981 |
| 18 | 35.40 | 2010 | 33.75 | 2000 | 25.77 | 2001 | 12.32 | 2001 |
| 19 | 36.14 | 1978 | 34.36 | 2010 | 27.04 | 1992 | 15.88 | 2000 |
| 20 | 36.26 | 1986 | 34.41 | 1979 | 32.05 | 2000 | 16.71 | 2008 |
| 21 | 36.43 | 1994 | 35.07 | 2002 | 32.50 | 2015 | 18.46 | 2003 |
| 22 | 38.58 | 2016 | 35.37 | 1988 | 34.59 | 2007 | 18.84 | 2007 |
| 23 | 38.80 | 2021 | 37.42 | 2009 | 37.42 | 2004 | 19.56 | 1998 |
| 24 | 39.33 | 2009 | 38.82 | 1994 | 39.72 | 1995 | 20.96 | 1999 |
| 25 | 39.43 | 2005 | 38.82 | 2014 | 42.14 | 2005 | 21.16 | 2005 |
| 26 | 39.80 | 1989 | 39.06 | 1990 | 43.89 | 2003 | 21.43 | 2009 |
| 27 | 39.87 | 2015 | 39.68 | 1984 | 44.04 | 1994 | 25.09 | 2020 |
| 28 | 41.30 | 2006 | 40.72 | 1975 | 44.78 | 2022 | 26.63 | 2013 |
| 29 | 41.40 | 1980 | 41.01 | 2018 | 46.00 | 1996 | 26.65 | 2021 |
| 30 | 41.69 | 1981 | 41.80 | 1974 | 46.79 | 2008 | 26.68 | 1997 |
| 31 | 41.81 | 1987 | 42.46 | 2001 | 46.79 | 2006 | 26.69 | 2015 |
| 32 | 42.53 | 2014 | 44.29 | 2019 | 47.12 | 2019 | 27.14 | 2019 |
| 33 | 42.63 | 1992 | 44.35 | 1989 | 47.69 | 2020 | 30.97 | 1978 |
| 34 | 45.08 | 2003 | 44.77 | 2012 | 50.31 | 2021 | 31.56 | 2022 |
| 35 | 49.82 | 2008 | 46.60 | 1991 | 53.55 | 1993 | 32.50 | 2017 |
| 36 | 50.03 | 2022 | 46.85 | 1987 | 54.48 | 2014 | 32.67 | 2010 |
| 37 | 50.43 | 1985 | 49.67 | 2011 | 55.22 | 2018 | 34.04 | 2012 |
| 38 | 51.03 | 2004 | 50.06 | 1983 | 56.99 | 2017 | 34.67 | 2014 |
| 39 | 52.30 | 2013 | 53.30 | 2013 | 61.20 | 2013 | 35.81 | 2018 |
| 40 | 52.61 | 1982 | 53.63 | 1982 | 64.49 | 2016 | 41.90 | 2011 |
| 41 | 55.03 | 2018 | 55.80 | 2017 | 66.20 | 1977 | 43.92 | 2016 |
| 42 | 60.90 | 1973 | 56.94 | 2020 | 72.40 | 2009 | 44.50 | 1993 |
| 43 | 63.99 | 2020 | 57.49 | 1985 | 78.97 | 2023 | 55.10 | 1995 |
| 44 | 65.50 | 2011 | 58.04 | 2016 | 86.87 | 2012 | 56.39 | 2023 |
| 45 | 68.80 | 1977 | 59.00 | 2021 | 93.01 | 2011 | 66.17 | 1996 |
| 46 | 77.38 | 1972 | 62.84 | 1977 | 96.14 | 2010 | 66.37 | 1994 |
| 47 | 78.31 | 2012 | 64.05 | 2022 | NaN | 1970 | 92.03 | 1973 |
| 48 | 80.71 | 2023 | 81.49 | 2023 | NaN | 1971 | NaN | 1970 |
| 49 | NaN | 1970 | 84.46 | 1972 | NaN | 1972 | NaN | 1971 |
| 50 | NaN | 1971 | NaN | 1970 | NaN | 1973 | NaN | 1972 |
| 51 | NaN | 1974 | NaN | 1971 | NaN | 1974 | NaN | 1974 |
| 52 | NaN | 1975 | NaN | 1973 | NaN | 1975 | NaN | 1975 |
| 53 | NaN | 1976 | NaN | 1978 | NaN | 1976 | NaN | 1976 |
First, we need to prepare the median dataframe.
# create a global median for the region for a given year
g_median_df = grp_c_df.median(axis=1).reset_index().rename(columns={"WaterYear" : "Year", 0 : 'Global_median'})
g_median_df.index.name = 'index'
g_median_df.sort_values("Global_median", ascending=False )
| Year | Global_median | |
|---|---|---|
| index | ||
| 2 | 1972 | 87.380 |
| 53 | 2023 | 84.145 |
| 42 | 2012 | 72.720 |
| 41 | 2011 | 72.430 |
| 50 | 2020 | 70.790 |
| 52 | 2022 | 68.340 |
| 7 | 1977 | 66.260 |
| 47 | 2017 | 65.920 |
| 3 | 1973 | 64.585 |
| 46 | 2016 | 64.490 |
| 48 | 2018 | 63.860 |
| 15 | 1985 | 63.370 |
| 24 | 1994 | 62.950 |
| 51 | 2021 | 62.890 |
| 18 | 1988 | 62.310 |
| 43 | 2013 | 62.305 |
| 23 | 1993 | 62.170 |
| 35 | 2005 | 60.830 |
| 36 | 2006 | 60.460 |
| 12 | 1982 | 59.950 |
| 49 | 2019 | 59.110 |
| 40 | 2010 | 58.980 |
| 19 | 1989 | 58.650 |
| 39 | 2009 | 58.390 |
| 8 | 1978 | 58.300 |
| 33 | 2003 | 58.240 |
| 45 | 2015 | 57.860 |
| 34 | 2004 | 57.370 |
| 29 | 1999 | 56.900 |
| 17 | 1987 | 55.780 |
| 22 | 1992 | 55.680 |
| 28 | 1998 | 54.710 |
| 27 | 1997 | 54.660 |
| 44 | 2014 | 54.480 |
| 0 | 1970 | 54.410 |
| 26 | 1996 | 53.610 |
| 21 | 1991 | 53.500 |
| 20 | 1990 | 52.760 |
| 31 | 2001 | 51.370 |
| 13 | 1983 | 50.060 |
| 25 | 1995 | 49.840 |
| 38 | 2008 | 49.820 |
| 9 | 1979 | 48.340 |
| 1 | 1971 | 48.310 |
| 37 | 2007 | 47.540 |
| 11 | 1981 | 46.730 |
| 32 | 2002 | 46.400 |
| 30 | 2000 | 45.360 |
| 14 | 1984 | 43.990 |
| 16 | 1986 | 42.900 |
| 10 | 1980 | 41.400 |
| 5 | 1975 | 39.815 |
| 4 | 1974 | 39.420 |
| 6 | 1976 | 38.240 |
And it might be useful to get the number of stations in a given year, although it is expected that the number of stations should not vary significantly. For majority of the examined period there were 23 stations reporting.
stations_count_yearly_grp_c_df = pd.DataFrame(grp_c_df.count(axis=1), columns = ["StationsCount"])
stations_count_yearly_grp_c_df
| StationsCount | |
|---|---|
| WaterYear | |
| 1970 | 2 |
| 1971 | 3 |
| 1972 | 19 |
| 1973 | 12 |
| 1974 | 9 |
| 1975 | 10 |
| 1976 | 15 |
| 1977 | 20 |
| 1978 | 12 |
| 1979 | 23 |
| 1980 | 23 |
| 1981 | 23 |
| 1982 | 23 |
| 1983 | 23 |
| 1984 | 23 |
| 1985 | 23 |
| 1986 | 23 |
| 1987 | 23 |
| 1988 | 23 |
| 1989 | 23 |
| 1990 | 23 |
| 1991 | 23 |
| 1992 | 23 |
| 1993 | 23 |
| 1994 | 23 |
| 1995 | 23 |
| 1996 | 23 |
| 1997 | 23 |
| 1998 | 23 |
| 1999 | 23 |
| 2000 | 23 |
| 2001 | 23 |
| 2002 | 23 |
| 2003 | 23 |
| 2004 | 23 |
| 2005 | 23 |
| 2006 | 23 |
| 2007 | 23 |
| 2008 | 23 |
| 2009 | 23 |
| 2010 | 19 |
| 2011 | 19 |
| 2012 | 19 |
| 2013 | 22 |
| 2014 | 21 |
| 2015 | 23 |
| 2016 | 23 |
| 2017 | 23 |
| 2018 | 23 |
| 2019 | 23 |
| 2020 | 23 |
| 2021 | 23 |
| 2022 | 22 |
| 2023 | 22 |
stations_count_yearly_grp_c_df.describe()
| StationsCount | |
|---|---|
| count | 54.000000 |
| mean | 20.740741 |
| std | 4.983691 |
| min | 2.000000 |
| 25% | 22.000000 |
| 50% | 23.000000 |
| 75% | 23.000000 |
| max | 23.000000 |
""" Plots the median per a given year, also plots the number of stations per
given year.
@param grp_df_ (in) The group of stations we want to consider
@param stations_count_df_ (in) The dataframe with he stations count
@param rec_count (in) the number of records in that group avail.
@return a plot with yearly median for the region represented by grp_df_
"""
def plot_grp_median_horizontally(grp_df_, stations_count_df_, rec_count):
# merge the dataframe with wqi values into a dataframe having
# the location, stations names and stations' ids so we can
# nicely plot the clean locations
stations_count_df_ = stations_count_df_.reset_index().rename(columns={"WaterYear" : "Year"})
df = grp_df_.merge(stations_count_df_)
f = px.line(df, x="Year", y=["Global_median"], markers=True,
title=f"The yearly WQI median of all stations with at least "
f"with at least {rec_count} records available "
f"in King County, WA, and the number of stations for each year "
f"over years {YEAR_MIN}-{YEAR_MAX}.",
height = 900,
labels = { "value" : "WQI", "variable" : ""})
f.update_layout(height = 900)
f.update_xaxes(dtick=2)
f.update_yaxes(dtick=10)
f.add_annotation(x='1970', y=40, text="High concern")
f.add_annotation(x='1970', y=80, text="Low concern")
f.add_trace(go.Bar(x=df["Year"], y = df["StationsCount"],
name="Stations Count"))
f.update_yaxes(range = [-5, 100])
return add_WQI_lines(f)
#plot_grp_median_horizontally(g_median_df, stations_count_yearly_grp_c_df, C_GRP_COUNT).show()
If we look at the median for the entire region year by year, the quality of water remains within the Modern concern. Year 2023 is incomplete and as such should be excluded from the analysis or analyzed separately. Now, since we know a little bit more about the entire history of the region, let's take a look at the last ten years, excluding 2023 year. For that, we need to prepare a different dataframe.
"""Plots a box plot with a line to connect medians across the year for the
region defined by grp_df_
@param grp_df_ (in) The group of stations that we are interested in
@param rec_count (in) The number of records guaranteed to be per
each station
@param stations_count_df_ (in) The dataframe with the stations count
@param period_ (in) string describing the years in the title
@return fig The figure with medians across a given year with a line
connecting medians for each year.
"""
def plot_yearly_medians(grp_df_, rec_count, stations_count_df_, period_):
# transpose the diagram
df = grp_df_.transpose()
median_series = df.median(axis = 0)
stats_count_df = stations_count_df_.reset_index().rename(columns={"WaterYear" : "Year"})
#print(median_series.to_list())
#print(median_series.iloc[:,0])
#print(median_series)
#print(grp_df_.head())
#print(df.head())
fig = px.box(df, #points='all',
title=f"Yearly stats for King County, WA, including stations "
f"with at least {rec_count}"
f" records for a period {period_}.",
labels = { "value" : "WQI",
"variable" : "Stations IDs (Locators)"},
height = 900)
fig.add_hline(y=40, line_dash="dash", line_color="red")
fig.add_annotation(x=0, y=40, text="High concern")
fig.add_hline(y=80, line_dash="dash", line_color="green")
fig.add_annotation(x=0, y=80, text="Low concern", ay=50)
fig.update_xaxes(tickangle = 45)
fig.update_layout(yaxis_title="WQI / Stations Count")
fig.add_trace(go.Scatter(x = median_series.index, y = median_series.to_list(),
name="Medians' line"))
fig.add_trace(go.Bar(x=stats_count_df["Year"], y = stats_count_df["StationsCount"],
name="Stations Count"))
return fig
plot_yearly_medians(grp_c_df, C_GRP_COUNT, stations_count_yearly_grp_c_df,
f"{YEAR_MIN}-{YEAR_MAX}").show()
# the ten years' period that I want to examine
DELTA_YEARS = { 'start' : 2013, 'end' : 2022}
First, let's look at the dataframe. I removed the 2023 year as incomplete and all years prior to 2013.
# the dataframe for the years I am interested in
df10 = df_an.loc[df_an.index.isin(range(DELTA_YEARS['start'], 1 + DELTA_YEARS['end']))]
print(df10.shape)
df10.info()
df10
(10, 78) <class 'pandas.core.frame.DataFrame'> Index: 10 entries, 2013 to 2022 Data columns (total 78 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0450CC 10 non-null float64 1 0484A 2 non-null float64 2 3106 10 non-null float64 3 311 8 non-null float64 4 317 10 non-null float64 5 321 10 non-null float64 6 322 10 non-null float64 7 430 10 non-null float64 8 434 10 non-null float64 9 438 10 non-null float64 10 440 10 non-null float64 11 442 10 non-null float64 12 444 10 non-null float64 13 446 10 non-null float64 14 470 10 non-null float64 15 474 10 non-null float64 16 478 10 non-null float64 17 484 9 non-null float64 18 486 10 non-null float64 19 631 10 non-null float64 20 632 9 non-null float64 21 A315 10 non-null float64 22 A319 8 non-null float64 23 A320 10 non-null float64 24 A432 10 non-null float64 25 A438 8 non-null float64 26 A456 10 non-null float64 27 A499 8 non-null float64 28 A617 10 non-null float64 29 A620 10 non-null float64 30 A631 10 non-null float64 31 A670 8 non-null float64 32 A680 10 non-null float64 33 A685 10 non-null float64 34 A687 4 non-null float64 35 A690 8 non-null float64 36 AMES_1 10 non-null float64 37 B319 10 non-null float64 38 B484 10 non-null float64 39 B499 6 non-null float64 40 BB470 8 non-null float64 41 BSE_1MUDMTNRD 10 non-null float64 42 C320 10 non-null float64 43 C370 10 non-null float64 44 C446 0 non-null float64 45 C484 10 non-null float64 46 CHERRY_1 10 non-null float64 47 D320 10 non-null float64 48 D474 8 non-null float64 49 F321 10 non-null float64 50 G320 10 non-null float64 51 GRIFFIN 10 non-null float64 52 HARRIS_1 10 non-null float64 53 J484 2 non-null float64 54 KSHZ06 10 non-null float64 55 KTHA01 10 non-null float64 56 KTHA02 8 non-null float64 57 KTHA03 10 non-null float64 58 LSIN1 10 non-null float64 59 LSIN9 10 non-null float64 60 MFK_SNQ 10 non-null float64 61 N484 10 non-null float64 62 NFK_SNQ 10 non-null float64 63 PATTER_3 10 non-null float64 64 RAGING_MTH 10 non-null float64 65 S478 8 non-null float64 66 S484 8 non-null float64 67 SFK_SNQ 10 non-null float64 68 SKYKOMISH 10 non-null float64 69 SNQDUVALL 10 non-null float64 70 TOLT_MTH 10 non-null float64 71 VA12A 9 non-null float64 72 VA37A 9 non-null float64 73 VA41A 9 non-null float64 74 VA42A 10 non-null float64 75 VA45A 9 non-null float64 76 VA65A 10 non-null float64 77 X630 10 non-null float64 dtypes: float64(78) memory usage: 6.2 KB
| 0450CC | 0484A | 3106 | 311 | 317 | 321 | 322 | 430 | 434 | 438 | 440 | 442 | 444 | 446 | 470 | 474 | 478 | 484 | 486 | 631 | 632 | A315 | A319 | A320 | A432 | A438 | A456 | A499 | A617 | A620 | A631 | A670 | A680 | A685 | A687 | A690 | AMES_1 | B319 | B484 | B499 | BB470 | BSE_1MUDMTNRD | C320 | C370 | C446 | C484 | CHERRY_1 | D320 | D474 | F321 | G320 | GRIFFIN | HARRIS_1 | J484 | KSHZ06 | KTHA01 | KTHA02 | KTHA03 | LSIN1 | LSIN9 | MFK_SNQ | N484 | NFK_SNQ | PATTER_3 | RAGING_MTH | S478 | S484 | SFK_SNQ | SKYKOMISH | SNQDUVALL | TOLT_MTH | VA12A | VA37A | VA41A | VA42A | VA45A | VA65A | X630 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WaterYear | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2013 | 39.22 | NaN | 63.99 | NaN | 26.63 | 86.67 | 52.30 | 63.62 | 53.30 | 82.60 | 80.50 | 77.87 | 36.54 | 61.18 | 45.94 | 25.88 | 69.49 | 59.66 | 46.11 | 73.94 | 63.41 | 61.20 | NaN | 87.70 | 75.60 | NaN | 50.96 | NaN | 76.87 | 69.29 | 80.26 | NaN | 67.15 | 91.89 | NaN | NaN | 52.88 | 88.90 | 27.57 | NaN | NaN | 59.33 | 78.24 | 52.09 | NaN | 63.53 | 86.85 | 90.55 | NaN | 93.28 | 65.76 | 85.29 | 91.59 | NaN | 52.88 | 54.33 | NaN | 83.13 | 76.68 | 99.93 | 77.62 | 69.23 | 74.85 | 66.55 | 81.25 | NaN | NaN | 62.20 | 87.91 | 78.81 | 91.50 | NaN | NaN | NaN | 77.55 | NaN | 72.66 | 53.42 |
| 2014 | 40.20 | NaN | 66.95 | NaN | 34.67 | 56.92 | 42.53 | 41.95 | 38.82 | 83.19 | 64.23 | 47.38 | 21.33 | 56.70 | 40.08 | 27.83 | 54.81 | 41.31 | 44.51 | 53.75 | NaN | 54.48 | NaN | 82.39 | 53.95 | NaN | 35.01 | NaN | 55.55 | 71.47 | 52.59 | NaN | 54.71 | 76.32 | NaN | NaN | 46.09 | 84.33 | 35.54 | NaN | NaN | 75.28 | 69.95 | 39.42 | NaN | 44.48 | 87.21 | 84.28 | NaN | 91.13 | 59.50 | 85.18 | 82.95 | NaN | 44.29 | 54.51 | NaN | 61.60 | 86.65 | 99.37 | 81.20 | 62.31 | 85.19 | 61.82 | 67.26 | NaN | NaN | 66.86 | 87.73 | 88.20 | 85.16 | 89.46 | 77.17 | 60.44 | 49.14 | 30.41 | 70.40 | 41.84 |
| 2015 | 35.69 | NaN | 62.42 | 62.91 | 26.69 | 62.54 | 39.87 | 55.76 | 33.16 | 78.78 | 76.42 | 70.58 | 41.08 | 61.86 | 57.45 | 34.86 | 45.59 | 48.61 | 57.86 | 65.66 | 44.39 | 32.50 | 81.03 | 84.30 | 44.40 | 83.06 | 35.49 | 53.69 | 69.94 | 69.77 | 76.10 | 72.49 | 72.48 | 89.12 | NaN | 81.36 | 43.03 | 84.78 | 32.04 | NaN | 64.15 | 71.82 | 75.59 | 48.77 | NaN | 46.97 | 83.52 | 83.87 | 46.83 | 95.05 | 60.68 | 86.93 | 88.20 | 74.45 | 48.49 | 76.30 | 39.59 | 63.21 | 11.25 | 83.47 | 67.61 | 64.06 | 82.79 | 69.68 | 64.93 | 69.52 | 37.72 | 60.61 | 84.26 | 82.32 | 79.20 | 74.61 | 72.76 | 66.32 | 58.33 | 33.95 | 65.70 | 39.65 |
| 2016 | 41.45 | NaN | 58.55 | 56.27 | 43.92 | 62.32 | 38.58 | 67.18 | 58.04 | 66.29 | 51.59 | 57.33 | 44.48 | 57.24 | 67.98 | 40.83 | 68.68 | 60.42 | 62.45 | 69.11 | 70.30 | 64.49 | 60.31 | 88.81 | 71.11 | 90.54 | 29.52 | 55.72 | 53.50 | 63.43 | 71.81 | 75.00 | 77.57 | 77.44 | NaN | 81.62 | 44.29 | 78.88 | 49.31 | NaN | 62.31 | 51.06 | 90.49 | 49.94 | NaN | 65.86 | 91.11 | 86.31 | 68.32 | 69.50 | 70.92 | 91.47 | 90.21 | 80.92 | 36.44 | 72.16 | 28.15 | 72.96 | 30.99 | 87.16 | 80.06 | 77.89 | 86.43 | 53.48 | 89.79 | 71.72 | 41.02 | 77.65 | 89.41 | 54.18 | 89.02 | 50.54 | 56.33 | 42.18 | 52.45 | 35.15 | 38.81 | 46.38 |
| 2017 | 55.91 | NaN | 72.43 | 70.98 | 32.50 | 60.13 | 23.69 | 59.61 | 55.80 | 87.90 | 61.54 | 63.31 | 58.70 | 65.92 | 69.09 | 45.72 | 57.44 | 65.70 | 57.45 | 66.96 | 66.02 | 56.99 | 79.89 | 76.91 | 62.43 | 87.79 | 66.75 | 83.99 | 74.31 | 63.58 | 70.91 | 78.77 | 73.66 | 85.07 | NaN | 78.24 | 61.31 | 84.69 | 52.20 | 75.10 | 74.16 | 70.07 | 86.61 | 49.92 | NaN | 66.18 | 90.45 | 91.62 | 65.22 | 94.51 | 75.22 | 92.46 | 92.96 | NaN | 44.93 | 81.92 | 47.96 | 57.36 | 22.72 | 91.96 | 83.81 | 76.95 | 86.28 | 66.57 | 79.96 | 65.48 | 47.36 | 89.38 | 92.05 | 77.50 | 86.98 | 63.63 | 53.97 | 59.29 | 37.49 | 44.69 | 58.77 | 46.96 |
| 2018 | 65.77 | NaN | 66.96 | 65.76 | 35.81 | 56.57 | 55.03 | 63.86 | 41.01 | 83.55 | 75.81 | 82.72 | 55.80 | 66.66 | 57.48 | 41.66 | 61.01 | 72.83 | 63.59 | 65.34 | 37.08 | 55.22 | 86.64 | 85.58 | 69.32 | 90.58 | 62.35 | 82.97 | 63.51 | 63.57 | 77.02 | 47.63 | 79.17 | 84.95 | NaN | 72.42 | 60.09 | 90.77 | 48.27 | 62.88 | 74.22 | 72.38 | 78.97 | 43.72 | NaN | 76.73 | 81.73 | 88.56 | 62.01 | 95.76 | 75.05 | 89.17 | 87.79 | NaN | 63.44 | 62.63 | 63.31 | 73.95 | 18.42 | 85.47 | 78.11 | 80.05 | 82.36 | 56.10 | 78.62 | 84.57 | 42.41 | 83.57 | 92.81 | 69.00 | 87.73 | 78.33 | 78.85 | 70.18 | 70.18 | 24.11 | 72.29 | 47.06 |
| 2019 | 32.79 | NaN | 68.22 | 68.28 | 27.14 | 57.94 | 34.45 | 44.58 | 44.29 | 85.04 | 66.73 | 80.48 | 42.50 | 60.01 | 53.55 | 47.51 | 59.96 | 55.95 | 58.49 | 59.11 | 57.12 | 47.12 | 75.70 | 83.25 | 53.73 | 85.55 | 34.82 | 75.86 | 64.82 | 62.00 | 63.04 | 80.33 | 78.24 | 90.22 | 72.67 | 85.45 | 76.14 | 87.38 | 40.03 | 49.72 | 72.64 | 70.02 | 77.73 | 56.23 | NaN | 58.63 | 90.29 | 85.14 | 48.17 | 95.72 | 66.13 | 92.85 | 92.49 | NaN | 65.34 | 66.78 | 66.21 | 81.09 | 28.10 | 84.88 | 84.68 | 66.95 | 91.10 | 77.12 | 80.16 | 72.53 | 32.57 | 83.36 | 89.99 | 83.94 | 90.98 | 81.47 | 69.81 | 76.89 | 74.91 | 34.70 | 73.79 | 54.34 |
| 2020 | 62.50 | NaN | 72.85 | 75.55 | 25.09 | 62.46 | 63.99 | 69.91 | 56.94 | 87.70 | 81.81 | 86.24 | 54.35 | 67.64 | 69.74 | 47.49 | 81.21 | 81.00 | 70.76 | 79.11 | 70.79 | 47.69 | 79.32 | 84.73 | 67.70 | 87.20 | 67.86 | 82.47 | 85.30 | 80.76 | 77.06 | 77.11 | 82.31 | 87.09 | 81.00 | 78.23 | 53.86 | 87.42 | 53.80 | 67.06 | 87.20 | 74.76 | 83.60 | 52.60 | NaN | 84.58 | 81.07 | 89.10 | 70.43 | 95.44 | 79.61 | 86.00 | 80.94 | NaN | 49.97 | 56.44 | 59.53 | 80.00 | 18.62 | 91.60 | 67.49 | 85.02 | 79.68 | 63.13 | 58.93 | 83.63 | 37.92 | 57.46 | 92.90 | 64.77 | 75.65 | 86.31 | 69.56 | 78.02 | 76.12 | 35.61 | 77.56 | 72.79 |
| 2021 | 43.73 | 72.12 | 66.44 | 62.89 | 26.65 | 60.60 | 38.80 | 68.43 | 59.00 | 72.80 | 67.06 | 56.78 | 41.14 | 61.53 | 61.24 | 48.14 | 67.38 | 71.62 | 47.22 | 72.85 | 49.49 | 50.31 | 71.23 | 71.27 | 66.62 | 83.47 | 65.00 | 78.63 | 58.52 | 49.28 | 73.87 | 62.38 | 67.01 | 75.32 | 67.48 | 86.11 | 51.32 | 83.81 | 27.80 | 66.01 | 76.17 | 76.31 | 78.19 | 48.91 | NaN | 66.95 | 79.36 | 77.65 | 66.21 | 91.92 | 67.12 | 84.17 | 91.02 | NaN | 55.71 | 38.36 | 30.38 | 64.13 | 7.13 | 91.20 | 73.77 | 72.56 | 85.72 | 69.06 | 63.23 | 82.41 | 35.08 | 65.47 | 88.69 | 75.77 | 81.06 | 67.88 | 29.74 | 33.14 | 59.38 | 19.78 | 52.87 | 42.49 |
| 2022 | 48.03 | 69.78 | 72.72 | 70.42 | 31.56 | 54.33 | 50.03 | 78.68 | 64.05 | 87.05 | 79.95 | 69.99 | 41.24 | 71.33 | 64.29 | 52.62 | 66.26 | NaN | 54.30 | 76.68 | 33.41 | 44.78 | 76.63 | 75.39 | 75.22 | 90.72 | 63.93 | 84.70 | 67.23 | 71.99 | 80.58 | 69.16 | 76.84 | 73.12 | 79.96 | 84.24 | 52.75 | 93.22 | 51.97 | 67.51 | 77.87 | 55.98 | 79.39 | 60.28 | NaN | 73.67 | 87.18 | 88.20 | 63.49 | 95.60 | 70.42 | 92.50 | 84.69 | NaN | 64.40 | 57.99 | 60.66 | 77.25 | 18.52 | 90.99 | 84.86 | 74.37 | 86.65 | 64.31 | 87.36 | 81.90 | 38.37 | 80.05 | 94.14 | 87.25 | 94.85 | 80.86 | 68.57 | 64.91 | 75.73 | 52.90 | 55.20 | 52.20 |
Below, there are stations in the decreasing order of the count of available records for the given time period.
# frequencies of the number of records per station for the given 10-year
# period; this is a series
series10_rec_freq = get_and_sort_freqs(df10)
series10_rec_freq
0450CC 10 A680 10 AMES_1 10 B319 10 B484 10 BSE_1MUDMTNRD 10 C320 10 C370 10 C484 10 CHERRY_1 10 D320 10 F321 10 G320 10 GRIFFIN 10 HARRIS_1 10 KSHZ06 10 KTHA01 10 KTHA03 10 LSIN1 10 LSIN9 10 MFK_SNQ 10 N484 10 NFK_SNQ 10 PATTER_3 10 RAGING_MTH 10 SFK_SNQ 10 SKYKOMISH 10 SNQDUVALL 10 TOLT_MTH 10 VA42A 10 VA65A 10 A685 10 X630 10 478 10 317 10 438 10 440 10 442 10 444 10 446 10 470 10 474 10 430 10 322 10 486 10 631 10 321 10 A315 10 A320 10 A456 10 3106 10 A631 10 A620 10 A617 10 434 10 A432 10 484 9 VA41A 9 VA45A 9 VA37A 9 VA12A 9 632 9 S478 8 311 8 S484 8 A670 8 KTHA02 8 D474 8 A319 8 A438 8 A499 8 BB470 8 A690 8 B499 6 A687 4 J484 2 0484A 2 C446 0 dtype: int64
# show the histogram of the record availability frequencies
plot_freqs(series10_rec_freq,
f"Frequency of available records for timeframe "
f"{DELTA_YEARS['start']}-{DELTA_YEARS['end']} "
f"available at a locator.").show()
Let's see where the stations that have 8, 9, and 10 records available are located. It will help decide whether we should only focus on locators with 10 records available.
# remove the stations that with frequencies less than 8 available records
series10_rec_freq = series10_rec_freq[series10_rec_freq >= 8]
""" Change the series to the dataframe
@param s_ (in): The series to be changed
@param col_name_ (in) : the name what the series represent
@param reset_idx_ (in): True the index will be reset, otherwise not
@param idx_name_ (in) : how the index will be named, only matters if
reset_idx is set to True
@return df : the dataframe corresponding to a s_ with index as a first
column and series as the second column
"""
def series2df(s_, col_name_, idx_name_ = 'index', reset_idx_ = True ):
if (reset_idx_):
# so your index starts with 0, 1, 2,
df = s_.to_frame(name=col_name_ ).reset_index()
df = df.rename(columns= {'index' : idx_name_})
# otherwise the name of the index is None
df.index.name = 'index'
else:
# just make it a frame
df = s_.to_frame(name=col_name_)
return df
""" Prepares the record frequency series for drawing and studying
@param s_ (in) series to be changed, i.e., records count per station
@param loc_df_ (in) The dataframe with station locations that will
be merged to the resultant location
@return df The combined dataframe that has stations' data and records
counting
"""
def prepare_rec_freq_series(s_, loc_df_):
# convert the series to a dataframe with proper column names
# the name of the default column
COL1 = 'Rec_count'
rec_count_df = series2df(s_, COL1, 'Locator')
# stratify the the dataframe by the Rec_count column
bins = [0, 8, 9, 10]
labels = ['8', '9', '10']
rec_count_df['Rec_count_lbl'] = pd.cut(rec_count_df[COL1], bins = bins, labels=labels)
#print(rec_count_df)
# merge the dataframe with wqi values into a dataframe having
# the location, stations names and stations' ids so we can
# nicely plot the clean locations
# print(loc_df_.columns)
df = rec_count_df.merge(loc_df_[['Locator', 'SiteName', 'lng', 'lat']], on = 'Locator')
#print(df)
return df
# 10 last years records count per stations
rec_freq_10_df = prepare_rec_freq_series(series10_rec_freq, loc_df)
"""
Prepares the figure to figure out where the stations with
a given number of records are
@param rec_freq_df_ (in) main dataframe to be presented
@param title_ (in) The title of the figure
@return figure
"""
def plot_density_scatter(df_, title_=""):
#print(df_)
f = px.scatter_mapbox(df_, lat = "lat", lon = "lng",
center = dict (lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG+0.2), zoom = 8,
hover_name = "SiteName", hover_data = ["Locator", "Rec_count"],
mapbox_style = MAP_STYLE, title = title_,
size = 'Rec_count',
size_max = 20,
color = 'Rec_count_lbl',
#labels = { "legend" : {"-1" : "<40", "0" : "[40; 80)", "1" : ">=80"} },
#legend = {"-1" : "<40", "0" : "[40; 80)", "1" : ">=80"},
color_discrete_map = {
'8' : "red",
'9' : "yellow",
'10' : "blue"
},
category_orders = {
'Rec_count_lbl' : [ '10', '9', '8']
},
width = 800, height = 900)
#f.update_layout(legend={ labels : {"-1" : "<40", "0" : "[40; 80)", "1" : ">=80"}})
f.update_layout(legend_title_text="Record count")
return f
plot_density_scatter(rec_freq_10_df,
"Stations in King County, WA, grouped by the number of "
f"records available for the period "
f"{DELTA_YEARS['start']}-{DELTA_YEARS['end']}.").show()
The coverage of the region by stations holding 10 records if pretty comprehensive. The stations with 9 records are mostly on Vashon Island (4 records). Two other stations are located in Redmond and Issaquah. The 11 stations holding 8 records to some extent overlap area with stations holding 10 records. There are three northern locations that are not covered by any other stations (North Creek), one in Bellevue (Cochran Springs), and one at the East Renton area (Cedar River).
As we can see below, the missing years are mostly 2013 and 2014, apart from one station 484 for which only the record from 2022 is unavailable. So in order to have a longer timeframe and a decent region coverage I will use only stations that offer 10 records for the last 10 years. Another option would be to exclude 2013 and 2014 years from the analysis and focus on the period 2015-2022.
# check which years are missing
locs = rec_freq_10_df[rec_freq_10_df['Rec_count'] < 10]['Locator']
get_missing_years(dat_fr=df10, loc_list = locs.to_list())
[['484', [2022]], ['VA41A', [2013]], ['VA45A', [2013]], ['VA37A', [2013]], ['VA12A', [2013]], ['632', [2014]], ['S478', [2013, 2014]], ['311', [2013, 2014]], ['S484', [2013, 2014]], ['A670', [2013, 2014]], ['KTHA02', [2013, 2014]], ['D474', [2013, 2014]], ['A319', [2013, 2014]], ['A438', [2013, 2014]], ['A499', [2013, 2014]], ['BB470', [2013, 2014]], ['A690', [2013, 2014]]]
# that is a dataframe with all 10 records available
only10_df = df10.loc[:, df10.columns.isin (series10_rec_freq[series10_rec_freq == 10].index)]
print(only10_df)
print(only10_df.info())
print(only10_df.describe())
0450CC 3106 317 321 322 430 434 438 440 \
WaterYear
2013 39.22 63.99 26.63 86.67 52.30 63.62 53.30 82.60 80.50
2014 40.20 66.95 34.67 56.92 42.53 41.95 38.82 83.19 64.23
2015 35.69 62.42 26.69 62.54 39.87 55.76 33.16 78.78 76.42
2016 41.45 58.55 43.92 62.32 38.58 67.18 58.04 66.29 51.59
2017 55.91 72.43 32.50 60.13 23.69 59.61 55.80 87.90 61.54
2018 65.77 66.96 35.81 56.57 55.03 63.86 41.01 83.55 75.81
2019 32.79 68.22 27.14 57.94 34.45 44.58 44.29 85.04 66.73
2020 62.50 72.85 25.09 62.46 63.99 69.91 56.94 87.70 81.81
2021 43.73 66.44 26.65 60.60 38.80 68.43 59.00 72.80 67.06
2022 48.03 72.72 31.56 54.33 50.03 78.68 64.05 87.05 79.95
442 444 446 470 474 478 486 631 A315 \
WaterYear
2013 77.87 36.54 61.18 45.94 25.88 69.49 46.11 73.94 61.20
2014 47.38 21.33 56.70 40.08 27.83 54.81 44.51 53.75 54.48
2015 70.58 41.08 61.86 57.45 34.86 45.59 57.86 65.66 32.50
2016 57.33 44.48 57.24 67.98 40.83 68.68 62.45 69.11 64.49
2017 63.31 58.70 65.92 69.09 45.72 57.44 57.45 66.96 56.99
2018 82.72 55.80 66.66 57.48 41.66 61.01 63.59 65.34 55.22
2019 80.48 42.50 60.01 53.55 47.51 59.96 58.49 59.11 47.12
2020 86.24 54.35 67.64 69.74 47.49 81.21 70.76 79.11 47.69
2021 56.78 41.14 61.53 61.24 48.14 67.38 47.22 72.85 50.31
2022 69.99 41.24 71.33 64.29 52.62 66.26 54.30 76.68 44.78
A320 A432 A456 A617 A620 A631 A680 A685 AMES_1 \
WaterYear
2013 87.70 75.60 50.96 76.87 69.29 80.26 67.15 91.89 52.88
2014 82.39 53.95 35.01 55.55 71.47 52.59 54.71 76.32 46.09
2015 84.30 44.40 35.49 69.94 69.77 76.10 72.48 89.12 43.03
2016 88.81 71.11 29.52 53.50 63.43 71.81 77.57 77.44 44.29
2017 76.91 62.43 66.75 74.31 63.58 70.91 73.66 85.07 61.31
2018 85.58 69.32 62.35 63.51 63.57 77.02 79.17 84.95 60.09
2019 83.25 53.73 34.82 64.82 62.00 63.04 78.24 90.22 76.14
2020 84.73 67.70 67.86 85.30 80.76 77.06 82.31 87.09 53.86
2021 71.27 66.62 65.00 58.52 49.28 73.87 67.01 75.32 51.32
2022 75.39 75.22 63.93 67.23 71.99 80.58 76.84 73.12 52.75
B319 B484 BSE_1MUDMTNRD C320 C370 C484 CHERRY_1 D320 \
WaterYear
2013 88.90 27.57 59.33 78.24 52.09 63.53 86.85 90.55
2014 84.33 35.54 75.28 69.95 39.42 44.48 87.21 84.28
2015 84.78 32.04 71.82 75.59 48.77 46.97 83.52 83.87
2016 78.88 49.31 51.06 90.49 49.94 65.86 91.11 86.31
2017 84.69 52.20 70.07 86.61 49.92 66.18 90.45 91.62
2018 90.77 48.27 72.38 78.97 43.72 76.73 81.73 88.56
2019 87.38 40.03 70.02 77.73 56.23 58.63 90.29 85.14
2020 87.42 53.80 74.76 83.60 52.60 84.58 81.07 89.10
2021 83.81 27.80 76.31 78.19 48.91 66.95 79.36 77.65
2022 93.22 51.97 55.98 79.39 60.28 73.67 87.18 88.20
F321 G320 GRIFFIN HARRIS_1 KSHZ06 KTHA01 KTHA03 LSIN1 \
WaterYear
2013 93.28 65.76 85.29 91.59 52.88 54.33 83.13 76.68
2014 91.13 59.50 85.18 82.95 44.29 54.51 61.60 86.65
2015 95.05 60.68 86.93 88.20 48.49 76.30 63.21 11.25
2016 69.50 70.92 91.47 90.21 36.44 72.16 72.96 30.99
2017 94.51 75.22 92.46 92.96 44.93 81.92 57.36 22.72
2018 95.76 75.05 89.17 87.79 63.44 62.63 73.95 18.42
2019 95.72 66.13 92.85 92.49 65.34 66.78 81.09 28.10
2020 95.44 79.61 86.00 80.94 49.97 56.44 80.00 18.62
2021 91.92 67.12 84.17 91.02 55.71 38.36 64.13 7.13
2022 95.60 70.42 92.50 84.69 64.40 57.99 77.25 18.52
LSIN9 MFK_SNQ N484 NFK_SNQ PATTER_3 RAGING_MTH SFK_SNQ \
WaterYear
2013 99.93 77.62 69.23 74.85 66.55 81.25 62.20
2014 99.37 81.20 62.31 85.19 61.82 67.26 66.86
2015 83.47 67.61 64.06 82.79 69.68 64.93 60.61
2016 87.16 80.06 77.89 86.43 53.48 89.79 77.65
2017 91.96 83.81 76.95 86.28 66.57 79.96 89.38
2018 85.47 78.11 80.05 82.36 56.10 78.62 83.57
2019 84.88 84.68 66.95 91.10 77.12 80.16 83.36
2020 91.60 67.49 85.02 79.68 63.13 58.93 57.46
2021 91.20 73.77 72.56 85.72 69.06 63.23 65.47
2022 90.99 84.86 74.37 86.65 64.31 87.36 80.05
SKYKOMISH SNQDUVALL TOLT_MTH VA42A VA65A X630
WaterYear
2013 87.91 78.81 91.50 77.55 72.66 53.42
2014 87.73 88.20 85.16 49.14 70.40 41.84
2015 84.26 82.32 79.20 58.33 65.70 39.65
2016 89.41 54.18 89.02 52.45 38.81 46.38
2017 92.05 77.50 86.98 37.49 58.77 46.96
2018 92.81 69.00 87.73 70.18 72.29 47.06
2019 89.99 83.94 90.98 74.91 73.79 54.34
2020 92.90 64.77 75.65 76.12 77.56 72.79
2021 88.69 75.77 81.06 59.38 52.87 42.49
2022 94.14 87.25 94.85 75.73 55.20 52.20
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 2013 to 2022
Data columns (total 56 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0450CC 10 non-null float64
1 3106 10 non-null float64
2 317 10 non-null float64
3 321 10 non-null float64
4 322 10 non-null float64
5 430 10 non-null float64
6 434 10 non-null float64
7 438 10 non-null float64
8 440 10 non-null float64
9 442 10 non-null float64
10 444 10 non-null float64
11 446 10 non-null float64
12 470 10 non-null float64
13 474 10 non-null float64
14 478 10 non-null float64
15 486 10 non-null float64
16 631 10 non-null float64
17 A315 10 non-null float64
18 A320 10 non-null float64
19 A432 10 non-null float64
20 A456 10 non-null float64
21 A617 10 non-null float64
22 A620 10 non-null float64
23 A631 10 non-null float64
24 A680 10 non-null float64
25 A685 10 non-null float64
26 AMES_1 10 non-null float64
27 B319 10 non-null float64
28 B484 10 non-null float64
29 BSE_1MUDMTNRD 10 non-null float64
30 C320 10 non-null float64
31 C370 10 non-null float64
32 C484 10 non-null float64
33 CHERRY_1 10 non-null float64
34 D320 10 non-null float64
35 F321 10 non-null float64
36 G320 10 non-null float64
37 GRIFFIN 10 non-null float64
38 HARRIS_1 10 non-null float64
39 KSHZ06 10 non-null float64
40 KTHA01 10 non-null float64
41 KTHA03 10 non-null float64
42 LSIN1 10 non-null float64
43 LSIN9 10 non-null float64
44 MFK_SNQ 10 non-null float64
45 N484 10 non-null float64
46 NFK_SNQ 10 non-null float64
47 PATTER_3 10 non-null float64
48 RAGING_MTH 10 non-null float64
49 SFK_SNQ 10 non-null float64
50 SKYKOMISH 10 non-null float64
51 SNQDUVALL 10 non-null float64
52 TOLT_MTH 10 non-null float64
53 VA42A 10 non-null float64
54 VA65A 10 non-null float64
55 X630 10 non-null float64
dtypes: float64(56)
memory usage: 4.5 KB
None
0450CC 3106 317 321 322 430 \
count 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000
mean 46.529000 67.153000 31.066000 62.048000 43.927000 61.358000
std 11.298463 4.707403 5.892317 9.099769 11.574836 11.358006
min 32.790000 58.550000 25.090000 54.330000 23.690000 41.950000
25% 39.465000 64.602500 26.660000 57.175000 38.635000 56.722500
50% 42.590000 66.955000 29.350000 60.365000 41.200000 63.740000
75% 53.940000 71.377500 34.127500 62.425000 51.732500 68.117500
max 65.770000 72.850000 43.920000 86.670000 63.990000 78.680000
434 438 440 442 444 446 \
count 10.000000 10.000000 10.000000 10.000000 10.00000 10.000000
mean 50.441000 81.490000 70.564000 69.268000 43.71600 63.007000
std 10.305312 7.035104 9.908746 12.840904 10.84246 4.730758
min 33.160000 66.290000 51.590000 47.380000 21.33000 56.700000
25% 41.830000 79.735000 64.855000 58.825000 41.09500 60.302500
50% 54.550000 83.370000 71.435000 70.285000 41.87000 61.695000
75% 57.765000 86.547500 79.067500 79.827500 51.88250 66.475000
max 64.050000 87.900000 81.810000 86.240000 58.70000 71.330000
470 474 478 486 631 A315 \
count 10.000000 10.000000 10.000000 10.00000 10.000000 10.000000
mean 58.684000 41.254000 63.183000 56.27400 68.251000 51.478000
std 9.959611 9.020747 9.716899 8.41883 7.845098 9.170964
min 40.080000 25.880000 45.590000 44.51000 53.750000 32.500000
25% 54.525000 36.352500 58.070000 48.99000 65.420000 47.262500
50% 59.360000 43.690000 63.635000 57.65500 68.035000 52.395000
75% 67.057500 47.505000 68.355000 61.46000 73.667500 56.547500
max 69.740000 52.620000 81.210000 70.76000 79.110000 64.490000
A320 A432 A456 A617 A620 A631 \
count 10.000000 10.0000 10.000000 10.000000 10.000000 10.000000
mean 82.033000 64.0080 51.169000 66.955000 66.514000 72.324000
std 5.684894 10.2900 15.788422 9.981175 8.299787 8.629143
min 71.270000 44.4000 29.520000 53.500000 49.280000 52.590000
25% 78.280000 56.0700 35.130000 59.767500 63.465000 71.135000
50% 83.775000 67.1600 56.655000 66.025000 66.435000 74.985000
75% 85.367500 70.6625 64.732500 73.217500 71.045000 77.050000
max 88.810000 75.6000 67.860000 85.300000 80.760000 80.580000
A680 A685 AMES_1 B319 B484 BSE_1MUDMTNRD \
count 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000
mean 72.914000 83.054000 54.176000 86.418000 41.853000 67.701000
std 8.117826 6.876159 9.818384 4.040747 10.487521 8.916982
min 54.710000 73.120000 43.030000 78.880000 27.570000 51.060000
25% 68.482500 76.600000 47.397500 84.420000 32.915000 62.002500
50% 75.250000 85.010000 52.815000 86.080000 44.150000 70.945000
75% 78.072500 88.612500 58.532500 88.530000 51.305000 74.165000
max 82.310000 91.890000 76.140000 93.220000 53.800000 76.310000
C320 C370 C484 CHERRY_1 D320 F321 \
count 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000
mean 79.876000 50.188000 64.758000 85.877000 86.528000 91.791000
std 5.773008 5.862799 12.429249 4.223272 4.062787 8.004076
min 69.950000 39.420000 44.480000 79.360000 77.650000 69.500000
25% 77.845000 48.805000 59.855000 82.177500 84.495000 92.260000
50% 78.605000 49.930000 66.020000 87.015000 87.255000 94.780000
75% 82.547500 52.472500 71.990000 89.520000 88.965000 95.560000
max 90.490000 60.280000 84.580000 91.110000 91.620000 95.760000
G320 GRIFFIN HARRIS_1 KSHZ06 KTHA01 KTHA03 \
count 10.00000 10.000000 10.000000 10.000000 10.000000 10.000000
mean 69.04100 88.602000 88.284000 52.589000 62.142000 71.468000
std 6.45747 3.473074 4.179767 9.670356 12.685679 9.199348
min 59.50000 84.170000 80.940000 36.440000 38.360000 57.360000
25% 65.85250 85.467500 85.465000 45.820000 54.992500 63.440000
50% 68.77000 88.050000 89.205000 51.425000 60.310000 73.455000
75% 74.01750 92.212500 91.447500 61.507500 70.815000 79.312500
max 79.61000 92.850000 92.960000 65.340000 81.920000 83.130000
LSIN1 LSIN9 MFK_SNQ N484 NFK_SNQ PATTER_3 \
count 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000
mean 31.908000 90.603000 77.921000 72.939000 84.105000 64.782000
std 27.242572 5.666953 6.466331 7.308639 4.463913 6.796064
min 7.130000 83.470000 67.490000 62.310000 74.850000 53.480000
25% 18.445000 85.892500 74.732500 67.520000 82.467500 62.147500
50% 20.670000 91.095000 79.085000 73.465000 85.455000 65.430000
75% 30.267500 91.870000 83.157500 77.655000 86.392500 68.437500
max 86.650000 99.930000 84.860000 85.020000 91.100000 77.120000
RAGING_MTH SFK_SNQ SKYKOMISH SNQDUVALL TOLT_MTH VA42A \
count 10.000000 10.000000 10.000000 10.000000 10.0000 10.000000
mean 75.149000 72.661000 89.989000 76.174000 86.2130 63.128000
std 10.712278 11.370424 3.021771 10.747145 6.0043 13.857216
min 58.930000 57.460000 84.260000 54.180000 75.6500 37.490000
25% 65.512500 63.017500 88.105000 70.692500 82.0850 53.920000
50% 79.290000 72.255000 89.700000 78.155000 87.3550 64.780000
75% 80.977500 82.532500 92.620000 83.535000 90.4900 75.525000
max 89.790000 89.380000 94.140000 88.200000 94.8500 77.550000
VA65A X630
count 10.000000 10.000000
mean 63.805000 49.713000
std 12.156905 9.516245
min 38.810000 39.650000
25% 56.092500 43.462500
50% 68.050000 47.010000
75% 72.567500 53.115000
max 77.560000 72.790000
Let's look at the yearly median for the whole region.
We can see that the yearly WQI median, apart from 58.21 in 2014, is greater than 64. In 2020, yearly WQI median achieved its highest value so far equal 75.205. In general, the region is closer to Low concern water quality than to High concern.
# this is how many records per station is available
ONLY_DF10_REC_COUNT = 10
plot_yearly_medians(grp_df_ = only10_df, rec_count=ONLY_DF10_REC_COUNT,
stations_count_df_= pd.DataFrame(only10_df.count(axis=1), columns = ["StationsCount"]),
period_=f"{DELTA_YEARS['start']}-{DELTA_YEARS['end']}; {len(only10_df.columns)} stations reporting").show()
Now let's look at individual stations.
Here is the summary of some stats for the group sorted in a descending order by median.
# get only10 stats
only10_stats_df = crt_stat_df(only10_df).reset_index()
only10_stats_df = only10_stats_df.rename(columns= {'index' : 'Locator'})
only10_stats_df.index.name = 'index'
only10_stats_df
| Locator | min | max | median | |
|---|---|---|---|---|
| index | ||||
| 0 | F321 | 69.50 | 95.76 | 94.780 |
| 1 | LSIN9 | 83.47 | 99.93 | 91.095 |
| 2 | SKYKOMISH | 84.26 | 94.14 | 89.700 |
| 3 | HARRIS_1 | 80.94 | 92.96 | 89.205 |
| 4 | GRIFFIN | 84.17 | 92.85 | 88.050 |
| 5 | TOLT_MTH | 75.65 | 94.85 | 87.355 |
| 6 | D320 | 77.65 | 91.62 | 87.255 |
| 7 | CHERRY_1 | 79.36 | 91.11 | 87.015 |
| 8 | B319 | 78.88 | 93.22 | 86.080 |
| 9 | NFK_SNQ | 74.85 | 91.10 | 85.455 |
| 10 | A685 | 73.12 | 91.89 | 85.010 |
| 11 | A320 | 71.27 | 88.81 | 83.775 |
| 12 | 438 | 66.29 | 87.90 | 83.370 |
| 13 | RAGING_MTH | 58.93 | 89.79 | 79.290 |
| 14 | MFK_SNQ | 67.49 | 84.86 | 79.085 |
| 15 | C320 | 69.95 | 90.49 | 78.605 |
| 16 | SNQDUVALL | 54.18 | 88.20 | 78.155 |
| 17 | A680 | 54.71 | 82.31 | 75.250 |
| 18 | A631 | 52.59 | 80.58 | 74.985 |
| 19 | N484 | 62.31 | 85.02 | 73.465 |
| 20 | KTHA03 | 57.36 | 83.13 | 73.455 |
| 21 | SFK_SNQ | 57.46 | 89.38 | 72.255 |
| 22 | 440 | 51.59 | 81.81 | 71.435 |
| 23 | BSE_1MUDMTNRD | 51.06 | 76.31 | 70.945 |
| 24 | 442 | 47.38 | 86.24 | 70.285 |
| 25 | G320 | 59.50 | 79.61 | 68.770 |
| 26 | VA65A | 38.81 | 77.56 | 68.050 |
| 27 | 631 | 53.75 | 79.11 | 68.035 |
| 28 | A432 | 44.40 | 75.60 | 67.160 |
| 29 | 3106 | 58.55 | 72.85 | 66.955 |
| 30 | A620 | 49.28 | 80.76 | 66.435 |
| 31 | A617 | 53.50 | 85.30 | 66.025 |
| 32 | C484 | 44.48 | 84.58 | 66.020 |
| 33 | PATTER_3 | 53.48 | 77.12 | 65.430 |
| 34 | VA42A | 37.49 | 77.55 | 64.780 |
| 35 | 430 | 41.95 | 78.68 | 63.740 |
| 36 | 478 | 45.59 | 81.21 | 63.635 |
| 37 | 446 | 56.70 | 71.33 | 61.695 |
| 38 | 321 | 54.33 | 86.67 | 60.365 |
| 39 | KTHA01 | 38.36 | 81.92 | 60.310 |
| 40 | 470 | 40.08 | 69.74 | 59.360 |
| 41 | 486 | 44.51 | 70.76 | 57.655 |
| 42 | A456 | 29.52 | 67.86 | 56.655 |
| 43 | 434 | 33.16 | 64.05 | 54.550 |
| 44 | AMES_1 | 43.03 | 76.14 | 52.815 |
| 45 | A315 | 32.50 | 64.49 | 52.395 |
| 46 | KSHZ06 | 36.44 | 65.34 | 51.425 |
| 47 | C370 | 39.42 | 60.28 | 49.930 |
| 48 | X630 | 39.65 | 72.79 | 47.010 |
| 49 | B484 | 27.57 | 53.80 | 44.150 |
| 50 | 474 | 25.88 | 52.62 | 43.690 |
| 51 | 0450CC | 32.79 | 65.77 | 42.590 |
| 52 | 444 | 21.33 | 58.70 | 41.870 |
| 53 | 322 | 23.69 | 63.99 | 41.200 |
| 54 | 317 | 25.09 | 43.92 | 29.350 |
| 55 | LSIN1 | 7.13 | 86.65 | 20.670 |
plot_grp_stats(only10_df, ONLY_DF10_REC_COUNT, f"{DELTA_YEARS['start']}-{DELTA_YEARS['end']}; {len(only10_df.columns)} stations", tickangle_=45).show()
get_fig_density_scatter(only10_df,loc_df, only10_df.median(axis = 0),
"WQI_median",
f"Stations in King County, WA, with at least {ONLY_DF10_REC_COUNT} records for "
f"the period {DELTA_YEARS['start']}-{DELTA_YEARS['end']}.").show()
l_df= SiteName Locator \
35 Crisp Creek below hatchery intake F321
43 Ravensdale mouth LSIN9
50 South Fork Skykomish at hwy 2 mile marker 47.5 SKYKOMISH
38 Harris Creek at Carnation Duvall Rd HARRIS_1
37 Griffin Creek approx 1.5 mi E of hwy 203 GRIFFIN
52 Tolt River mouth TOLT_MTH
34 Jenkins Creek at Kent Black Diamond Rd SE D320
33 Cherry Creek at NE Cherry Valley Rd CHERRY_1
27 Green River at 212th Way SE B319
46 North Fork Snoqualmie at 428th Ave SE NFK_SNQ
25 Ebright Creek mouth at E Lake Sammamish Pkwy A685
18 Soos Creek mouth below Soos Creek Hatchery A320
5 Cedar River at Bronson Way N 438
48 Raging River mouth near Dike Rd RAGING_MTH
44 Middle Fork Snoqualmie at 428th Ave SE MFK_SNQ
30 Covington Creek at SE Auburn-Black Diamond Rd C320
51 Snoqualmie River at Taylor Landing SNQDUVALL
24 Pine Lake Creek mouth at E Lake Sammamish Pkwy A680
23 Issaquah Creek upstream of hatchery A631
45 Cottage Lake Creek at Tolt Pipeline N484
41 Venema Creek at Pipers Creek Trail KTHA03
49 South Fork Snoqualmie at Snoqualmie Valley Trail SFK_SNQ
6 May Creek mouth at Lake Washington Blvd N 440
29 Boise Creek mouth at SE Mud Mountain Rd BSE_1MUDMTNRD
7 Coal Creek upstream of 119th Ave SE 442
36 Little Soos Creek at SE 272nd St G320
54 Gorsuch Creek mouth near end of SW 167th St VA65A
15 Issaquah Creek mouth at NW Sammamish Rd 631
19 McAleer at Bothell Way NE A432
16 Green River at Starfire Way 3106
22 Idylwood Creek mouth at Idylwood Park A620
21 Lewis Creek mouth at 187th Ave SE A617
32 Bear Creek at NE 95th St C484
47 Patterson Creek mouth at W Snoqualmie River Rd SE PATTER_3
53 Judd Creek near end of SW 225th St VA42A
3 Lyon Creek mouth at Bothell Way NE 430
13 Little Bear mouth near NE 178th St 478
9 Juanita Creek mouth at NE Juanita Dr 446
1 Crisp Creek mouth at SE Green Valley Rd 321
40 Piper's Creek KTHA01
11 Swamp Creek mouth at Bothell Way NE 470
14 Sammamish River at NE Marymoor Way 486
20 Forbes Creek at 108th Ave NE A456
4 Thornton Creek mouth at Sand Point Way NE 434
26 Ames Creek mouth at NE 100th St AMES_1
17 Mill Creek near 68th Ave S and S 262nd St A315
39 Pipers Creek mouth KSHZ06
31 Longfellow Creek at SW Yancy St C370
55 Tibbetts Creek mouth at footbridge X630
28 Evans Creek at NE Union Hill Rd B484
12 North Creek mouth at Sammamish River Trail 474
10 Sammamish River at NE 145th St 0450CC
8 Mercer Slough at 121st Ave SE 444
2 Newaukum Creek mouth at 212th Way SE 322
0 Springbrook Creek mouth at SW 16th St 317
42 Rock Creek mouth LSIN1
WQI_median
35 94.780
43 91.095
50 89.700
38 89.205
37 88.050
52 87.355
34 87.255
33 87.015
27 86.080
46 85.455
25 85.010
18 83.775
5 83.370
48 79.290
44 79.085
30 78.605
51 78.155
24 75.250
23 74.985
45 73.465
41 73.455
49 72.255
6 71.435
29 70.945
7 70.285
36 68.770
54 68.050
15 68.035
19 67.160
16 66.955
22 66.435
21 66.025
32 66.020
47 65.430
53 64.780
3 63.740
13 63.635
9 61.695
1 60.365
40 60.310
11 59.360
14 57.655
20 56.655
4 54.550
26 52.815
17 52.395
39 51.425
31 49.930
55 47.010
28 44.150
12 43.690
10 42.590
8 41.870
2 41.200
0 29.350
42 20.670
""" Counts the number of records satisfying the condition threshold
@param df_ dataframe we want to check the threshold
@param threshold_ >= threshold values will be counted
@param col_
@return number of records satisfied the condition
"""
def count_records(df_, threshold_, col_ = 'median'):
return df_[df_[col_] >= threshold_].shape[0]
def print_stat_counts(df_):
count_all = df_.shape[0]
count_ge80 = count_records(df_, 80)
count_ge40 = count_records(df_, 40)
count_less40 = count_all - count_ge40
print(f" median >= 80: {count_ge80} recs, i.e., {count_ge80/count_all}")
print(f"80 > median >= 40: {count_ge40 - count_ge80} recs, ie., {(count_ge40 - count_ge80)/count_all} records.")
print(f" median < 40: {count_less40} records, i.e., {count_less40/count_all}")
print(f"All records : {count_all} recs.")
print_stat_counts(only10_stats_df)
median >= 80: 13 recs, i.e., 0.23214285714285715
80 > median >= 40: 41 recs, ie., 0.7321428571428571 records.
median < 40: 2 records, i.e., 0.03571428571428571
All records : 56 recs.
80 > median >= 40: 41 recs, ie., 0.7321428571428571 records.
median < 40: 2 records, i.e., 0.03571428571428571
All records : 56 recs.
There are 13 stations ($\approx$ 23%) that recorded WQI with global median in the Low concern zone. They are located in the eastern and southern parts of the region, rather on the outskirts of the region.
There are two stations ($\approx$ 3.6%) with High concern WQIs: locator 317 (Springbrook Creek mouth at SW 16th St) with median WQI=29.35 and LSIN1* (Rock Creek mouth) with WQI median 20.67.
The highest WQI median has been recorded at F321 (Crisp Creek below hatchery intake) and it is 94.78. The second highest WQI median equals 91.095 and belongs to LSIN9 (Ravensdale mouth).
There are only two ($\approx$ 3.6%) High concern locators w.r.t. the WQI median. The majority, i.e. 41 out of 56 records ($\approx$ 73%), fall into the Modern concern category.
It seems that the water quality is better out of centers of bigger, densly populated areas.
plot_yearly_WQI(only10_df, get_stratified_median_df(only10_stats_df), title_ =
"AnnualScore WQI for all locators in Kings County, WA,"
f" with {ONLY_DF10_REC_COUNT} records"
f" for years {DELTA_YEARS['start']}-{DELTA_YEARS['end']}.The legend is sorted in a "
"descending order according to stations' WQI median."
f" {len(only10_df.columns)} stations.").run_server(debug=True)
F321 has the best global median of all locators for a given period, however, in 2016 its WQI scored 69.5 what qualified into the Modern concern catogory. Locators LSIN9, SKYKOMISH, HARRIS_1 and GRIFFIN have been staying in the Low concern category for a given period.
Locator RAGING_MTH has been systematically improving (2020 - 58.93, 2021 - 63.23, 2022 - 87.36). Similarly, MFK_SNQ has started improving from 58.93 in 2020, to 63.23 in 2021, to 84.86 in 2022.
Locator SNQDUVALL has an interesting history Modern concern - Low concern switchbacks. After plumetting to 63.99 in 2020 from 83.94 in 2019, it has been on a steady rise: 75.77 in 2021, and 87.25 in 2022.
Locator A620 shows an optimistic trend of the recovery trajectory: 80.76 in 2020, 49.28 in 2021, and 71.99 in 2022. Similarly A617, C484, VA42A; however, not that dramatic.
Locator 430 has been systematically recovering since 2019, when WQI scored almost the High concern category with 42.5. Over three years it has improved to almost Low concern with WQI 78.68 in 2022.
Locator B484 advanced to Modern concern with 51.97 in 2022 from 27.8 in 2021.
Locator BSE_1MUDMNTRD is troublesome because its WQI has decreased from 76.31 in 2021 to 55.98 in 2022.
Locator SFK_SNQ has also a history of sharp decrease in WQI from 83.36 in 2019 to 57.46 in 2020. However, it has been on a steady rise toward Low concern - 80.05 in 2022.
Locator A685 has recently deteriorated (2021 - 75.32 and 2022 - 73.12). Similarly, locator A320, although improved to 75.39 in 2022 from 71.27 in 2021, it changed its status from 84.73 Low concern in 2020 to 71.27 Modern concern in 2021.
Locator LSIN1 has the worst median WQI. In 2014, it scored the Low concern zone with 86.65; however, the next year it plummeted dramatically to the High concern category with WQI = 11.25. It has never recovered and stays in the High concern category. In 2022 its WQI scored 18.52.
There were two questions raised:
Q1: How has WQI changed over the years? Was WQI better in the past?
Q2: How does WQI change with locators' geography? Is there any pattern such as more densely populated areas have WQI worse than sparsely populated areas? Does any of the areas such as north/south/east/west have cleaner water over the others?
The data timespan was 1970-2023 with 78 stations recording WQI.
In order to answer question 1. we can use the diagrams describing yearly medians:
1970-2023 for 23 stations with at least 46 records available each. They were covering a decent area of the region. If they lack the data it can concern years $\le$ 1978 and/or [2010;2014] inclusive. Based on the analysis of yearly median it stays around middle of the Modern concern ie WQI=60, mostly in the lower half of Modern concern. In recent recent years, i.e., 2017 and above, it has been above 60 (or very close to 60, 2019 median WQI was 59.11). Based on that we can look in the future of our waters in a cautiosly optimistic way.
We can see that the number of stations stabilized in 1979. The period 1970-1978 was time when service at many stations was interrupted. Boxplot values show that 2022 is a winner compared to 1979: max: 93.22 vs. 83.92, Q3: 76.68 vs. 75.21, median: 68.34 vs. 48.34, Q1: 54.62 vs. 34.5, and min. 31.56 vs. 1, 2022 and 1979, respectively. The 1979 distribution is right-skewed, meaning the higher WQI scores (Q3) are more dispersed than Q1 scores. The 2022 distribution is left-skewed meaning that Q3 is less dispersed.
If we take a look at the yearly median of all stations with 10 records in 2013 and 2022, we can see a very little improvement 72.92 from 71.075.However, if we start the comparison in 2014, the yearly median was 58.21 across all stations. The highest median, 75.205, was in 2020. So basically, we can say that we are at the point where we were in 2013. Even the skeweness is similar, i.e., left-skewed, i.e., the data in Q3 is less dispersed, than in Q1. However, in terms of minimums and maximums, the data shows that 2022 is worse than 2013, 95.6 vs. 99.93 and 25.88 vs. 18.52, maximums and minimums respectively.
The conclusion that we might draw is that we are better off compared to the past (1979). Compared to 2013, median WQI has improved 1.845 points, but the minimums and maximums have worsened.
The map for the period 1970-2023 shows that densly populated areas are mostly in the Modern concern category. The three Low concern stations are located in the southern part of the region. And even they are close to Modern concern or High concern. Locator D320 with WQI median 84.845 is relatively close to G320 with 65.945 and locator B319 with median 85.25 is nearby 322 with WQI median 39.33.
The map for 2013-2022 with locators and median for 10 records shows that eastern and south parts of the region have the highest median WQI. However, there are stations that yield Low concern WQIs and stations that yield Modern concern or High concern WQIs and they are located in geographical vicinity. Examples are F321 94.78 and B319 with 86.08 and 321 with 60.365 and 322 with 47.2741, or LSIN9 with 91.095 and LSIN1 with 20.76.
Also 438 (Cedar River at Bronson Way N) in Renton (densely populated area) has outstanding WQI median of 83.37 for 10 years, whereas nearby 317 (*Springbrook Creek mouth at SW 16th St) has WQI 29.35.
From the analysis, it seems that WQI will be likely higher in certain parts of the region, i.e., eastern and southern parts, and the remaining parts are mostly in Modern concern category. However, it seems that it depends more on a particular waterway and in fact more research is required on why Low concern locators are nearby High concern and Modern concern locators.
[1] King County. County, “Water Quality Index Background.” King County Washington State, 2023. Available at: King County WQI